Zipformer-transducer-based Models

Hint

Please refer to Installation to install sherpa-onnx before you read this section.

sherpa-onnx-zipformer-vi-2025-04-20 (Vietnamese, 越南语)

This model is from https://huggingface.co/zzasdf/viet_iter3_pseudo_label, which is trained on about 70k hours of data.

In the following, we describe how to download it and use it with sherpa-onnx.

Download the model

Please use the following commands to download it.

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-zipformer-vi-2025-04-20.tar.bz2
tar xvf sherpa-onnx-zipformer-vi-2025-04-20.tar.bz2
rm sherpa-onnx-zipformer-vi-2025-04-20.tar.bz2

You should see something like below after downloading:

ls -lh sherpa-onnx-zipformer-vi-2025-04-20

total 259M
-rw-r--r-- 1 1001 118 265K Apr 20 15:49 bpe.model
-rw-r--r-- 1 1001 118 5.0M Apr 20 15:49 decoder-epoch-12-avg-8.onnx
-rw-r--r-- 1 1001 118 249M Apr 20 15:49 encoder-epoch-12-avg-8.onnx
-rw-r--r-- 1 1001 118 4.0M Apr 20 15:49 joiner-epoch-12-avg-8.onnx
-rw-r--r-- 1 1001 118  149 Apr 20 15:49 README.md
drwxr-xr-x 2 1001 118 4.0K Apr 20 15:49 test_wavs
-rw-r--r-- 1 1001 118  26K Apr 20 15:49 tokens.txt

Decode wave files

Hint

It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate does not need to be 16 kHz.

The following code shows how to use float32 models to decode wave files:

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline \
  --tokens=./sherpa-onnx-zipformer-vi-2025-04-20/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-vi-2025-04-20/encoder-epoch-12-avg-8.onnx \
  --decoder=./sherpa-onnx-zipformer-vi-2025-04-20/decoder-epoch-12-avg-8.onnx \
  --joiner=./sherpa-onnx-zipformer-vi-2025-04-20/joiner-epoch-12-avg-8.onnx \
  --num-threads=1 \
  ./sherpa-onnx-zipformer-vi-2025-04-20/test_wavs/0.wav

Note

Please use ./build/bin/Release/sherpa-onnx-offline.exe for Windows.

Caution

If you use Windows and get encoding issues, please run:

CHCP 65001

in your commandline.

You should see the following output:

/project/sherpa-onnx/csrc/parse-options.cc:Read:375 sherpa-onnx-offline --tokens=./sherpa-onnx-zipformer-vi-2025-04-20/tokens.txt --encoder=./sherpa-onnx-zipformer-vi-2025-04-20/encoder-epoch-12-avg-8.onnx --decoder=./sherpa-onnx-zipformer-vi-2025-04-20/decoder-epoch-12-avg-8.onnx --joiner=./sherpa-onnx-zipformer-vi-2025-04-20/joiner-epoch-12-avg-8.onnx --num-threads=1 ./sherpa-onnx-zipformer-vi-2025-04-20/test_wavs/0.wav

OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80, low_freq=20, high_freq=-400, dither=0, normalize_samples=True, snip_edges=False), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-zipformer-vi-2025-04-20/encoder-epoch-12-avg-8.onnx", decoder_filename="./sherpa-onnx-zipformer-vi-2025-04-20/decoder-epoch-12-avg-8.onnx", joiner_filename="./sherpa-onnx-zipformer-vi-2025-04-20/joiner-epoch-12-avg-8.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe", tail_paddings=-1), fire_red_asr=OfflineFireRedAsrModelConfig(encoder="", decoder=""), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model=""), wenet_ctc=OfflineWenetCtcModelConfig(model=""), sense_voice=OfflineSenseVoiceModelConfig(model="", language="auto", use_itn=False), moonshine=OfflineMoonshineModelConfig(preprocessor="", encoder="", uncached_decoder="", cached_decoder=""), dolphin=OfflineDolphinModelConfig(model=""), telespeech_ctc="", tokens="./sherpa-onnx-zipformer-vi-2025-04-20/tokens.txt", num_threads=1, debug=False, provider="cpu", model_type="", modeling_unit="cjkchar", bpe_vocab=""), lm_config=OfflineLMConfig(model="", scale=0.5), ctc_fst_decoder_config=OfflineCtcFstDecoderConfig(graph="", max_active=3000), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5, blank_penalty=0, rule_fsts="", rule_fars="")
Creating recognizer ...
Started
Done!

./sherpa-onnx-zipformer-vi-2025-04-20/test_wavs/0.wav
{"lang": "", "emotion": "", "event": "", "text": " RỒI CŨNG HỖ TRỢ CHO LÂU LÂU CŨNG CHO GẠO CHO NÀY KIA", "timestamps": [0.00, 0.20, 0.52, 0.76, 1.00, 1.28, 1.52, 1.76, 1.96, 2.28, 2.60, 2.92, 3.28], "tokens":[" RỒI", " CŨNG", " HỖ", " TRỢ", " CHO", " LÂU", " LÂU", " CŨNG", " CHO", " GẠO", " CHO", " NÀY", " KIA"], "words": []}
----
num threads: 1
decoding method: greedy_search
Elapsed seconds: 0.237 s
Real time factor (RTF): 0.237 / 3.740 = 0.063

Speech recognition from a microphone

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-microphone-offline \
  --tokens=./sherpa-onnx-zipformer-vi-2025-04-20/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-vi-2025-04-20/encoder-epoch-12-avg-8.onnx \
  --decoder=./sherpa-onnx-zipformer-vi-2025-04-20/decoder-epoch-12-avg-8.onnx \
  --joiner=./sherpa-onnx-zipformer-vi-2025-04-20/joiner-epoch-12-avg-8.onnx

Speech recognition from a microphone with VAD

cd /path/to/sherpa-onnx

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx

./build/bin/sherpa-onnx-vad-microphone-offline-asr \
  --silero-vad-model=./silero_vad.onnx \
  --tokens=./sherpa-onnx-zipformer-vi-2025-04-20/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-vi-2025-04-20/encoder-epoch-12-avg-8.onnx \
  --decoder=./sherpa-onnx-zipformer-vi-2025-04-20/decoder-epoch-12-avg-8.onnx \
  --joiner=./sherpa-onnx-zipformer-vi-2025-04-20/joiner-epoch-12-avg-8.onnx

sherpa-onnx-zipformer-vi-int8-2025-04-20 (Vietnamese, 越南语)

This model is from https://huggingface.co/zzasdf/viet_iter3_pseudo_label, which is trained on about 70k hours of data.

In the following, we describe how to download it and use it with sherpa-onnx.

Download the model

Please use the following commands to download it.

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-zipformer-vi-int8-2025-04-20.tar.bz2
tar xvf sherpa-onnx-zipformer-vi-int8-2025-04-20.tar.bz2
rm sherpa-onnx-zipformer-vi-int8-2025-04-20.tar.bz2

You should see something like below after downloading:

ls -lh sherpa-onnx-zipformer-vi-int8-2025-04-20

total 74M
-rw-r--r-- 1 1001 118  265K Apr 20 15:50 bpe.model
-rw-r--r-- 1 1001 118  5.0M Apr 20 15:50 decoder-epoch-12-avg-8.onnx
-rw-r--r-- 1 1001 118   68M Apr 20 15:50 encoder-epoch-12-avg-8.int8.onnx
-rw-r--r-- 1 1001 118 1010K Apr 20 15:50 joiner-epoch-12-avg-8.int8.onnx
-rw-r--r-- 1 1001 118   149 Apr 20 15:50 README.md
drwxr-xr-x 2 1001 118  4.0K Apr 20 15:49 test_wavs
-rw-r--r-- 1 1001 118   26K Apr 20 15:50 tokens.txt

Decode wave files

Hint

It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate does not need to be 16 kHz.

The following code shows how to use int8 models to decode wave files:

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline \
  --tokens=./sherpa-onnx-zipformer-vi-int8-2025-04-20/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-vi-int8-2025-04-20/encoder-epoch-12-avg-8.int8.onnx \
  --decoder=./sherpa-onnx-zipformer-vi-int8-2025-04-20/decoder-epoch-12-avg-8.onnx \
  --joiner=./sherpa-onnx-zipformer-vi-int8-2025-04-20/joiner-epoch-12-avg-8.int8.onnx \
  --num-threads=1 \
  ./sherpa-onnx-zipformer-vi-int8-2025-04-20/test_wavs/0.wav

Note

Please use ./build/bin/Release/sherpa-onnx-offline.exe for Windows.

Caution

If you use Windows and get encoding issues, please run:

CHCP 65001

in your commandline.

You should see the following output:

/project/sherpa-onnx/csrc/parse-options.cc:Read:375 sherpa-onnx-offline --tokens=./sherpa-onnx-zipformer-vi-int8-2025-04-20/tokens.txt --encoder=./sherpa-onnx-zipformer-vi-int8-2025-04-20/encoder-epoch-12-avg-8.int8.onnx --decoder=./sherpa-onnx-zipformer-vi-int8-2025-04-20/decoder-epoch-12-avg-8.onnx --joiner=./sherpa-onnx-zipformer-vi-int8-2025-04-20/joiner-epoch-12-avg-8.int8.onnx --num-threads=1 ./sherpa-onnx-zipformer-vi-int8-2025-04-20/test_wavs/0.wav

OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80, low_freq=20, high_freq=-400, dither=0, normalize_samples=True, snip_edges=False), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-zipformer-vi-int8-2025-04-20/encoder-epoch-12-avg-8.int8.onnx", decoder_filename="./sherpa-onnx-zipformer-vi-int8-2025-04-20/decoder-epoch-12-avg-8.onnx", joiner_filename="./sherpa-onnx-zipformer-vi-int8-2025-04-20/joiner-epoch-12-avg-8.int8.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe", tail_paddings=-1), fire_red_asr=OfflineFireRedAsrModelConfig(encoder="", decoder=""), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model=""), wenet_ctc=OfflineWenetCtcModelConfig(model=""), sense_voice=OfflineSenseVoiceModelConfig(model="", language="auto", use_itn=False), moonshine=OfflineMoonshineModelConfig(preprocessor="", encoder="", uncached_decoder="", cached_decoder=""), dolphin=OfflineDolphinModelConfig(model=""), telespeech_ctc="", tokens="./sherpa-onnx-zipformer-vi-int8-2025-04-20/tokens.txt", num_threads=1, debug=False, provider="cpu", model_type="", modeling_unit="cjkchar", bpe_vocab=""), lm_config=OfflineLMConfig(model="", scale=0.5), ctc_fst_decoder_config=OfflineCtcFstDecoderConfig(graph="", max_active=3000), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5, blank_penalty=0, rule_fsts="", rule_fars="")
Creating recognizer ...
Started
Done!

./sherpa-onnx-zipformer-vi-int8-2025-04-20/test_wavs/0.wav
{"lang": "", "emotion": "", "event": "", "text": " RỒI CŨNG HỖ TRỢ CHO LÂU LÂU CŨNG CHO GẠO CHO NÀY KIA", "timestamps": [0.00, 0.20, 0.52, 0.76, 1.00, 1.28, 1.52, 1.76, 1.96, 2.28, 2.60, 2.92, 3.28], "tokens":[" RỒI", " CŨNG", " HỖ", " TRỢ", " CHO", " LÂU", " LÂU", " CŨNG", " CHO", " GẠO", " CHO", " NÀY", " KIA"], "words": []}
----
num threads: 1
decoding method: greedy_search
Elapsed seconds: 0.229 s
Real time factor (RTF): 0.229 / 3.740 = 0.061

Speech recognition from a microphone

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-microphone-offline \
  --tokens=./sherpa-onnx-zipformer-vi-int8-2025-04-20/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-vi-int8-2025-04-20/encoder-epoch-12-avg-8.int8.onnx \
  --decoder=./sherpa-onnx-zipformer-vi-int8-2025-04-20/decoder-epoch-12-avg-8.onnx \
  --joiner=./sherpa-onnx-zipformer-vi-int8-2025-04-20/joiner-epoch-12-avg-8.int8.onnx

Speech recognition from a microphone with VAD

cd /path/to/sherpa-onnx

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx

./build/bin/sherpa-onnx-vad-microphone-offline-asr \
  --silero-vad-model=./silero_vad.onnx \
  --tokens=./sherpa-onnx-zipformer-vi-int8-2025-04-20/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-vi-int8-2025-04-20/encoder-epoch-12-avg-8.int8.onnx \
  --decoder=./sherpa-onnx-zipformer-vi-int8-2025-04-20/decoder-epoch-12-avg-8.onnx \
  --joiner=./sherpa-onnx-zipformer-vi-int8-2025-04-20/joiner-epoch-12-avg-8.int8.onnx

sherpa-onnx-zipformer-zh-en-2023-11-22 (Chinese+English, 中英双语)

This model is from https://huggingface.co/zrjin/icefall-asr-zipformer-multi-zh-en-2023-11-22.

See https://github.com/k2-fsa/icefall/pull/1265 if you want to learn how the model is trained.

Note that this model uses byte-level BPE.

In the following, we describe how to download it and use it with sherpa-onnx.

Download the model

Please use the following commands to download it.

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-zipformer-zh-en-2023-11-22.tar.bz2
tar xvf sherpa-onnx-zipformer-zh-en-2023-11-22.tar.bz2
rm sherpa-onnx-zipformer-zh-en-2023-11-22.tar.bz2

You should see something like below after downloading:

ls -lh sherpa-onnx-zipformer-zh-en-2023-11-22
total 710824
-rw-r--r--  1 fangjun  staff   264K Nov 22  2023 bbpe.model
-rw-r--r--  1 fangjun  staff   4.9M Nov 22  2023 decoder-epoch-34-avg-19.onnx
-rw-r--r--  1 fangjun  staff    66M Nov 22  2023 encoder-epoch-34-avg-19.int8.onnx
-rw-r--r--  1 fangjun  staff   248M Nov 22  2023 encoder-epoch-34-avg-19.onnx
-rw-r--r--  1 fangjun  staff   1.0M Nov 22  2023 joiner-epoch-34-avg-19.int8.onnx
-rw-r--r--  1 fangjun  staff   3.9M Nov 22  2023 joiner-epoch-34-avg-19.onnx
drwxr-xr-x  5 fangjun  staff   160B Dec 24 15:50 test_wavs
-rw-r--r--  1 fangjun  staff    25K Dec 24 15:49 tokens.txt

Decode wave files

Hint

It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate does not need to be 16 kHz.

fp32

The following code shows how to use fp32 models to decode wave files:

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline \
  --tokens=./sherpa-onnx-zipformer-zh-en-2023-11-22/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-zh-en-2023-11-22/encoder-epoch-34-avg-19.onnx \
  --decoder=./sherpa-onnx-zipformer-zh-en-2023-11-22/decoder-epoch-34-avg-19.onnx \
  --joiner=./sherpa-onnx-zipformer-zh-en-2023-11-22/joiner-epoch-34-avg-19.onnx \
  --num-threads=1 \
  ./sherpa-onnx-zipformer-zh-en-2023-11-22/test_wavs/0.wav

Note

Please use ./build/bin/Release/sherpa-onnx-offline.exe for Windows.

Caution

If you use Windows and get encoding issues, please run:

CHCP 65001

in your commandline.

You should see the following output:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:375 ./build/bin/sherpa-onnx-offline --tokens=./sherpa-onnx-zipformer-zh-en-2023-11-22/tokens.txt --encoder=./sherpa-onnx-zipformer-zh-en-2023-11-22/encoder-epoch-34-avg-19.onnx --decoder=./sherpa-onnx-zipformer-zh-en-2023-11-22/decoder-epoch-34-avg-19.onnx --joiner=./sherpa-onnx-zipformer-zh-en-2023-11-22/joiner-epoch-34-avg-19.onnx --num-threads=1 ./sherpa-onnx-zipformer-zh-en-2023-11-22/test_wavs/0.wav 

OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80, low_freq=20, high_freq=-400, dither=0), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-zipformer-zh-en-2023-11-22/encoder-epoch-34-avg-19.onnx", decoder_filename="./sherpa-onnx-zipformer-zh-en-2023-11-22/decoder-epoch-34-avg-19.onnx", joiner_filename="./sherpa-onnx-zipformer-zh-en-2023-11-22/joiner-epoch-34-avg-19.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe", tail_paddings=-1), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model=""), wenet_ctc=OfflineWenetCtcModelConfig(model=""), sense_voice=OfflineSenseVoiceModelConfig(model="", language="auto", use_itn=False), moonshine=OfflineMoonshineModelConfig(preprocessor="", encoder="", uncached_decoder="", cached_decoder=""), telespeech_ctc="", tokens="./sherpa-onnx-zipformer-zh-en-2023-11-22/tokens.txt", num_threads=1, debug=False, provider="cpu", model_type="", modeling_unit="cjkchar", bpe_vocab=""), lm_config=OfflineLMConfig(model="", scale=0.5), ctc_fst_decoder_config=OfflineCtcFstDecoderConfig(graph="", max_active=3000), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5, blank_penalty=0, rule_fsts="", rule_fars="")
Creating recognizer ...
Started
Done!

./sherpa-onnx-zipformer-zh-en-2023-11-22/test_wavs/0.wav
{"lang": "", "emotion": "", "event": "", "text": "SPEND ON 在什么上面花费它不光是时间也可以是金钱", "timestamps": [0.00, 0.04, 0.32, 0.48, 0.64, 0.72, 0.88, 1.00, 1.24, 1.36, 1.60, 1.80, 1.96, 2.12, 2.32, 2.44, 2.60, 2.72, 2.80, 2.88, 3.00, 3.20], "tokens":["▁SP", "END", "▁ON", "▁ƌŁŎ", "▁Ƌšġ", "▁Ƌşĩ", "▁ƋŞī", "▁ƐłŇ", "▁Əīŗ", "▁ƏŚş", "▁ƌŔĤ", "▁ƋŞĮ", "▁ƌĦĪ", "▁ƍĻŕ", "▁ƍĺŜ", "▁ƐĺŚ", "▁Ƌşń", "▁ƌİŕ", "▁Ƌšŋ", "▁ƍĻŕ", "▁ƐĨĴ", "▁Ɛĵŗ"], "words": []}
----
num threads: 1
decoding method: greedy_search
Elapsed seconds: 0.183 s
Real time factor (RTF): 0.183 / 3.380 = 0.054

int8

The following code shows how to use int8 models to decode wave files:

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline \
  --tokens=./sherpa-onnx-zipformer-zh-en-2023-11-22/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-zh-en-2023-11-22/encoder-epoch-34-avg-19.int8.onnx \
  --decoder=./sherpa-onnx-zipformer-zh-en-2023-11-22/decoder-epoch-34-avg-19.onnx \
  --joiner=./sherpa-onnx-zipformer-zh-en-2023-11-22/joiner-epoch-34-avg-19.int8.onnx \
  --num-threads=1 \
  ./sherpa-onnx-zipformer-zh-en-2023-11-22/test_wavs/0.wav

You should see the following output:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:375 ./build/bin/sherpa-onnx-offline --tokens=./sherpa-onnx-zipformer-zh-en-2023-11-22/tokens.txt --encoder=./sherpa-onnx-zipformer-zh-en-2023-11-22/encoder-epoch-34-avg-19.int8.onnx --decoder=./sherpa-onnx-zipformer-zh-en-2023-11-22/decoder-epoch-34-avg-19.onnx --joiner=./sherpa-onnx-zipformer-zh-en-2023-11-22/joiner-epoch-34-avg-19.int8.onnx --num-threads=1 ./sherpa-onnx-zipformer-zh-en-2023-11-22/test_wavs/0.wav 

OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80, low_freq=20, high_freq=-400, dither=0), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-zipformer-zh-en-2023-11-22/encoder-epoch-34-avg-19.int8.onnx", decoder_filename="./sherpa-onnx-zipformer-zh-en-2023-11-22/decoder-epoch-34-avg-19.onnx", joiner_filename="./sherpa-onnx-zipformer-zh-en-2023-11-22/joiner-epoch-34-avg-19.int8.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe", tail_paddings=-1), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model=""), wenet_ctc=OfflineWenetCtcModelConfig(model=""), sense_voice=OfflineSenseVoiceModelConfig(model="", language="auto", use_itn=False), moonshine=OfflineMoonshineModelConfig(preprocessor="", encoder="", uncached_decoder="", cached_decoder=""), telespeech_ctc="", tokens="./sherpa-onnx-zipformer-zh-en-2023-11-22/tokens.txt", num_threads=1, debug=False, provider="cpu", model_type="", modeling_unit="cjkchar", bpe_vocab=""), lm_config=OfflineLMConfig(model="", scale=0.5), ctc_fst_decoder_config=OfflineCtcFstDecoderConfig(graph="", max_active=3000), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5, blank_penalty=0, rule_fsts="", rule_fars="")
Creating recognizer ...
Started
Done!

./sherpa-onnx-zipformer-zh-en-2023-11-22/test_wavs/0.wav
{"lang": "", "emotion": "", "event": "", "text": "SPEND ON 在什么上面花费它不光是时间也可以是金钱", "timestamps": [0.00, 0.04, 0.32, 0.48, 0.64, 0.72, 0.88, 1.00, 1.24, 1.36, 1.60, 1.80, 1.96, 2.12, 2.32, 2.44, 2.60, 2.72, 2.80, 2.88, 3.00, 3.20], "tokens":["▁SP", "END", "▁ON", "▁ƌŁŎ", "▁Ƌšġ", "▁Ƌşĩ", "▁ƋŞī", "▁ƐłŇ", "▁Əīŗ", "▁ƏŚş", "▁ƌŔĤ", "▁ƋŞĮ", "▁ƌĦĪ", "▁ƍĻŕ", "▁ƍĺŜ", "▁ƐĺŚ", "▁Ƌşń", "▁ƌİŕ", "▁Ƌšŋ", "▁ƍĻŕ", "▁ƐĨĴ", "▁Ɛĵŗ"], "words": []}
----
num threads: 1
decoding method: greedy_search
Elapsed seconds: 0.162 s
Real time factor (RTF): 0.162 / 3.380 = 0.048

Speech recognition from a microphone

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-microphone-offline \
  --tokens=./sherpa-onnx-zipformer-zh-en-2023-11-22/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-zh-en-2023-11-22/encoder-epoch-34-avg-19.onnx \
  --decoder=./sherpa-onnx-zipformer-zh-en-2023-11-22/decoder-epoch-34-avg-19.onnx \
  --joiner=./sherpa-onnx-zipformer-zh-en-2023-11-22/joiner-epoch-34-avg-19.onnx

Speech recognition from a microphone with VAD

cd /path/to/sherpa-onnx

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx

./build/bin/sherpa-onnx-vad-microphone-offline-asr \
  --silero-vad-model=./silero_vad.onnx \
  --tokens=./sherpa-onnx-zipformer-zh-en-2023-11-22/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-zh-en-2023-11-22/encoder-epoch-34-avg-19.onnx \
  --decoder=./sherpa-onnx-zipformer-zh-en-2023-11-22/decoder-epoch-34-avg-19.onnx \
  --joiner=./sherpa-onnx-zipformer-zh-en-2023-11-22/joiner-epoch-34-avg-19.onnx

sherpa-onnx-zipformer-ru-2024-09-18 (Russian, 俄语)

This model is from https://huggingface.co/alphacep/vosk-model-ru/tree/main.

You can find the export script at https://github.com/k2-fsa/sherpa-onnx/blob/master/.github/workflows/export-russian-onnx-models.yaml

In the following, we describe how to download it and use it with sherpa-onnx.

Download the model

Please use the following commands to download it.

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-zipformer-ru-2024-09-18.tar.bz2
tar xvf sherpa-onnx-zipformer-ru-2024-09-18.tar.bz2
rm sherpa-onnx-zipformer-ru-2024-09-18.tar.bz2

You should see something like below after downloading:

ls -lh sherpa-onnx-zipformer-ru-2024-09-18
total 700352
-rw-r--r--  1 fangjun  staff   240K Sep 18 12:01 bpe.model
-rw-r--r--  1 fangjun  staff   1.2M Sep 18 12:01 decoder.int8.onnx
-rw-r--r--  1 fangjun  staff   2.0M Sep 18 12:01 decoder.onnx
-rw-r--r--  1 fangjun  staff    65M Sep 18 12:01 encoder.int8.onnx
-rw-r--r--  1 fangjun  staff   247M Sep 18 12:01 encoder.onnx
-rw-r--r--  1 fangjun  staff   253K Sep 18 12:01 joiner.int8.onnx
-rw-r--r--  1 fangjun  staff   1.0M Sep 18 12:01 joiner.onnx
drwxr-xr-x  4 fangjun  staff   128B Sep 18 12:01 test_wavs
-rw-r--r--  1 fangjun  staff   6.2K Sep 18 12:01 tokens.txt

Hint

An updated version can be found at:

sherpa-onnx-zipformer-ru-2025-04-20.tar.bz2

sherpa-onnx-zipformer-ru-int8-2025-04-20.tar.bz2

Decode wave files

Hint

It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate does not need to be 16 kHz.

fp32

The following code shows how to use fp32 models to decode wave files:

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline \
  --tokens=./sherpa-onnx-zipformer-ru-2024-09-18/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-ru-2024-09-18/encoder.onnx \
  --decoder=./sherpa-onnx-zipformer-ru-2024-09-18/decoder.onnx \
  --joiner=./sherpa-onnx-zipformer-ru-2024-09-18/joiner.onnx \
  --num-threads=1 \
  ./sherpa-onnx-zipformer-ru-2024-09-18/test_wavs/1.wav

Note

Please use ./build/bin/Release/sherpa-onnx-offline.exe for Windows.

Caution

If you use Windows and get encoding issues, please run:

CHCP 65001

in your commandline.

You should see the following output:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:375 ./build/bin/sherpa-onnx-offline --tokens=./sherpa-onnx-zipformer-ru-2024-09-18/tokens.txt --encoder=./sherpa-onnx-zipformer-ru-2024-09-18/encoder.onnx --decoder=./sherpa-onnx-zipformer-ru-2024-09-18/decoder.onnx --joiner=./sherpa-onnx-zipformer-ru-2024-09-18/joiner.onnx --num-threads=1 ./sherpa-onnx-zipformer-ru-2024-09-18/test_wavs/1.wav 

OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80, low_freq=20, high_freq=-400, dither=0), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-zipformer-ru-2024-09-18/encoder.onnx", decoder_filename="./sherpa-onnx-zipformer-ru-2024-09-18/decoder.onnx", joiner_filename="./sherpa-onnx-zipformer-ru-2024-09-18/joiner.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe", tail_paddings=-1), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model=""), wenet_ctc=OfflineWenetCtcModelConfig(model=""), sense_voice=OfflineSenseVoiceModelConfig(model="", language="auto", use_itn=False), telespeech_ctc="", tokens="./sherpa-onnx-zipformer-ru-2024-09-18/tokens.txt", num_threads=1, debug=False, provider="cpu", model_type="", modeling_unit="cjkchar", bpe_vocab=""), lm_config=OfflineLMConfig(model="", scale=0.5), ctc_fst_decoder_config=OfflineCtcFstDecoderConfig(graph="", max_active=3000), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5, blank_penalty=0, rule_fsts="", rule_fars="")
Creating recognizer ...
Started
Done!

./sherpa-onnx-zipformer-ru-2024-09-18/test_wavs/1.wav
{"lang": "", "emotion": "", "event": "", "text": " родион потапыч высчитывал каждый новый вершок углубления и давно определил про себя", "timestamps": [0.00, 0.16, 0.28, 0.52, 0.68, 0.84, 0.96, 1.12, 1.44, 1.64, 1.76, 1.92, 2.08, 2.16, 2.36, 2.48, 2.60, 2.80, 2.96, 3.04, 3.20, 3.40, 3.44, 3.56, 3.68, 3.80, 3.88, 4.00, 4.16, 4.20, 4.64, 4.88, 5.08, 5.20, 5.44, 5.64, 5.68, 5.92, 6.32, 6.56], "tokens":[" ро", "ди", "он", " по", "та", "п", "ы", "ч", " вы", "с", "чи", "ты", "ва", "л", " ка", "жд", "ый", " но", "в", "ый", " вер", "ш", "о", "к", " у", "г", "лу", "б", "л", "ения", " и", " да", "в", "но", " оп", "ре", "дел", "ил", " про", " себя"], "words": []}
----
num threads: 1
decoding method: greedy_search
Elapsed seconds: 0.336 s
Real time factor (RTF): 0.336 / 7.080 = 0.047

int8

The following code shows how to use int8 models to decode wave files:

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline \
  --tokens=./sherpa-onnx-zipformer-ru-2024-09-18/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-ru-2024-09-18/encoder.int8.onnx \
  --decoder=./sherpa-onnx-zipformer-ru-2024-09-18/decoder.onnx \
  --joiner=./sherpa-onnx-zipformer-ru-2024-09-18/joiner.int8.onnx \
  --num-threads=1 \
  ./sherpa-onnx-zipformer-ru-2024-09-18/test_wavs/1.wav

You should see the following output:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:375 ./build/bin/sherpa-onnx-offline --tokens=./sherpa-onnx-zipformer-ru-2024-09-18/tokens.txt --encoder=./sherpa-onnx-zipformer-ru-2024-09-18/encoder.int8.onnx --decoder=./sherpa-onnx-zipformer-ru-2024-09-18/decoder.onnx --joiner=./sherpa-onnx-zipformer-ru-2024-09-18/joiner.int8.onnx --num-threads=1 ./sherpa-onnx-zipformer-ru-2024-09-18/test_wavs/1.wav 

OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80, low_freq=20, high_freq=-400, dither=0), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-zipformer-ru-2024-09-18/encoder.int8.onnx", decoder_filename="./sherpa-onnx-zipformer-ru-2024-09-18/decoder.onnx", joiner_filename="./sherpa-onnx-zipformer-ru-2024-09-18/joiner.int8.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe", tail_paddings=-1), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model=""), wenet_ctc=OfflineWenetCtcModelConfig(model=""), sense_voice=OfflineSenseVoiceModelConfig(model="", language="auto", use_itn=False), telespeech_ctc="", tokens="./sherpa-onnx-zipformer-ru-2024-09-18/tokens.txt", num_threads=1, debug=False, provider="cpu", model_type="", modeling_unit="cjkchar", bpe_vocab=""), lm_config=OfflineLMConfig(model="", scale=0.5), ctc_fst_decoder_config=OfflineCtcFstDecoderConfig(graph="", max_active=3000), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5, blank_penalty=0, rule_fsts="", rule_fars="")
Creating recognizer ...
Started
Done!

./sherpa-onnx-zipformer-ru-2024-09-18/test_wavs/1.wav
{"lang": "", "emotion": "", "event": "", "text": " родион потапыч высчитывал каждый новый вершок углубления и давно определил про себя", "timestamps": [0.00, 0.16, 0.28, 0.52, 0.68, 0.84, 0.96, 1.12, 1.44, 1.64, 1.76, 1.92, 2.08, 2.16, 2.36, 2.52, 2.60, 2.80, 2.96, 3.04, 3.20, 3.40, 3.44, 3.60, 3.68, 3.80, 3.88, 4.00, 4.16, 4.20, 4.68, 4.88, 5.08, 5.20, 5.44, 5.64, 5.68, 5.88, 6.32, 6.56], "tokens":[" ро", "ди", "он", " по", "та", "п", "ы", "ч", " вы", "с", "чи", "ты", "ва", "л", " ка", "жд", "ый", " но", "в", "ый", " вер", "ш", "о", "к", " у", "г", "лу", "б", "л", "ения", " и", " да", "в", "но", " оп", "ре", "дел", "ил", " про", " себя"], "words": []}
----
num threads: 1
decoding method: greedy_search
Elapsed seconds: 0.280 s
Real time factor (RTF): 0.280 / 7.080 = 0.040

Speech recognition from a microphone

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-microphone-offline \
  --tokens=./sherpa-onnx-zipformer-ru-2024-09-18/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-ru-2024-09-18/encoder.int8.onnx \
  --decoder=./sherpa-onnx-zipformer-ru-2024-09-18/decoder.onnx \
  --joiner=./sherpa-onnx-zipformer-ru-2024-09-18/joiner.int8.onnx

Speech recognition from a microphone with VAD

cd /path/to/sherpa-onnx

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx

./build/bin/sherpa-onnx-vad-microphone-offline-asr \
  --silero-vad-model=./silero_vad.onnx \
  --tokens=./sherpa-onnx-zipformer-ru-2024-09-18/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-ru-2024-09-18/encoder.int8.onnx \
  --decoder=./sherpa-onnx-zipformer-ru-2024-09-18/decoder.onnx \
  --joiner=./sherpa-onnx-zipformer-ru-2024-09-18/joiner.int8.onnx

sherpa-onnx-small-zipformer-ru-2024-09-18 (Russian, 俄语)

This model is from https://huggingface.co/alphacep/vosk-model-small-ru/tree/main.

You can find the export script at https://github.com/k2-fsa/sherpa-onnx/blob/master/.github/workflows/export-russian-onnx-models.yaml

In the following, we describe how to download it and use it with sherpa-onnx.

Download the model

Please use the following commands to download it.

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-small-zipformer-ru-2024-09-18.tar.bz2
tar xvf sherpa-onnx-small-zipformer-ru-2024-09-18.tar.bz2
rm sherpa-onnx-small-zipformer-ru-2024-09-18.tar.bz2

You should see something like below after downloading:

ls -lh  sherpa-onnx-small-zipformer-ru-2024-09-18/
total 257992
-rw-r--r--  1 fangjun  staff   240K Sep 18 12:02 bpe.model
-rw-r--r--  1 fangjun  staff   1.2M Sep 18 12:02 decoder.int8.onnx
-rw-r--r--  1 fangjun  staff   2.0M Sep 18 12:02 decoder.onnx
-rw-r--r--  1 fangjun  staff    24M Sep 18 12:02 encoder.int8.onnx
-rw-r--r--  1 fangjun  staff    86M Sep 18 12:02 encoder.onnx
-rw-r--r--  1 fangjun  staff   253K Sep 18 12:02 joiner.int8.onnx
-rw-r--r--  1 fangjun  staff   1.0M Sep 18 12:02 joiner.onnx
drwxr-xr-x  4 fangjun  staff   128B Sep 18 12:02 test_wavs
-rw-r--r--  1 fangjun  staff   6.2K Sep 18 12:02 tokens.txt

Decode wave files

Hint

It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate does not need to be 16 kHz.

fp32

The following code shows how to use fp32 models to decode wave files:

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline \
  --tokens=./sherpa-onnx-small-zipformer-ru-2024-09-18/tokens.txt \
  --encoder=./sherpa-onnx-small-zipformer-ru-2024-09-18/encoder.onnx \
  --decoder=./sherpa-onnx-small-zipformer-ru-2024-09-18/decoder.onnx \
  --joiner=./sherpa-onnx-small-zipformer-ru-2024-09-18/joiner.onnx \
  --num-threads=1 \
  ./sherpa-onnx-small-zipformer-ru-2024-09-18/test_wavs/1.wav

Note

Please use ./build/bin/Release/sherpa-onnx-offline.exe for Windows.

Caution

If you use Windows and get encoding issues, please run:

CHCP 65001

in your commandline.

You should see the following output:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:375 ./build/bin/sherpa-onnx-offline --tokens=./sherpa-onnx-small-zipformer-ru-2024-09-18/tokens.txt --encoder=./sherpa-onnx-small-zipformer-ru-2024-09-18/encoder.onnx --decoder=./sherpa-onnx-small-zipformer-ru-2024-09-18/decoder.onnx --joiner=./sherpa-onnx-small-zipformer-ru-2024-09-18/joiner.onnx --num-threads=1 ./sherpa-onnx-small-zipformer-ru-2024-09-18/test_wavs/1.wav 

OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80, low_freq=20, high_freq=-400, dither=0), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-small-zipformer-ru-2024-09-18/encoder.onnx", decoder_filename="./sherpa-onnx-small-zipformer-ru-2024-09-18/decoder.onnx", joiner_filename="./sherpa-onnx-small-zipformer-ru-2024-09-18/joiner.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe", tail_paddings=-1), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model=""), wenet_ctc=OfflineWenetCtcModelConfig(model=""), sense_voice=OfflineSenseVoiceModelConfig(model="", language="auto", use_itn=False), telespeech_ctc="", tokens="./sherpa-onnx-small-zipformer-ru-2024-09-18/tokens.txt", num_threads=1, debug=False, provider="cpu", model_type="", modeling_unit="cjkchar", bpe_vocab=""), lm_config=OfflineLMConfig(model="", scale=0.5), ctc_fst_decoder_config=OfflineCtcFstDecoderConfig(graph="", max_active=3000), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5, blank_penalty=0, rule_fsts="", rule_fars="")
Creating recognizer ...
Started
Done!

./sherpa-onnx-small-zipformer-ru-2024-09-18/test_wavs/1.wav
{"lang": "", "emotion": "", "event": "", "text": " родион потапыч высчитывал каждый новый вершок углубления и давно определил про себя", "timestamps": [0.00, 0.20, 0.28, 0.48, 0.68, 0.84, 0.92, 1.04, 1.48, 1.64, 1.76, 1.92, 2.08, 2.16, 2.40, 2.52, 2.60, 2.84, 3.00, 3.04, 3.20, 3.40, 3.48, 3.60, 3.68, 3.80, 3.88, 4.00, 4.12, 4.16, 4.72, 4.92, 5.12, 5.20, 5.48, 5.60, 5.68, 5.92, 6.28, 6.48], "tokens":[" ро", "ди", "он", " по", "та", "п", "ы", "ч", " вы", "с", "чи", "ты", "ва", "л", " ка", "жд", "ый", " но", "в", "ый", " вер", "ш", "о", "к", " у", "г", "лу", "б", "л", "ения", " и", " да", "в", "но", " оп", "ре", "дел", "ил", " про", " себя"], "words": []}
----
num threads: 1
decoding method: greedy_search
Elapsed seconds: 0.228 s
Real time factor (RTF): 0.228 / 7.080 = 0.032

int8

The following code shows how to use int8 models to decode wave files:

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline \
  --tokens=./sherpa-onnx-small-zipformer-ru-2024-09-18/tokens.txt \
  --encoder=./sherpa-onnx-small-zipformer-ru-2024-09-18/encoder.int8.onnx \
  --decoder=./sherpa-onnx-small-zipformer-ru-2024-09-18/decoder.onnx \
  --joiner=./sherpa-onnx-small-zipformer-ru-2024-09-18/joiner.int8.onnx \
  --num-threads=1 \
  ./sherpa-onnx-small-zipformer-ru-2024-09-18/test_wavs/1.wav

You should see the following output:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:375 ./build/bin/sherpa-onnx-offline --tokens=./sherpa-onnx-small-zipformer-ru-2024-09-18/tokens.txt --encoder=./sherpa-onnx-small-zipformer-ru-2024-09-18/encoder.int8.onnx --decoder=./sherpa-onnx-small-zipformer-ru-2024-09-18/decoder.onnx --joiner=./sherpa-onnx-small-zipformer-ru-2024-09-18/joiner.int8.onnx --num-threads=1 ./sherpa-onnx-small-zipformer-ru-2024-09-18/test_wavs/1.wav 

OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80, low_freq=20, high_freq=-400, dither=0), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-small-zipformer-ru-2024-09-18/encoder.int8.onnx", decoder_filename="./sherpa-onnx-small-zipformer-ru-2024-09-18/decoder.onnx", joiner_filename="./sherpa-onnx-small-zipformer-ru-2024-09-18/joiner.int8.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe", tail_paddings=-1), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model=""), wenet_ctc=OfflineWenetCtcModelConfig(model=""), sense_voice=OfflineSenseVoiceModelConfig(model="", language="auto", use_itn=False), telespeech_ctc="", tokens="./sherpa-onnx-small-zipformer-ru-2024-09-18/tokens.txt", num_threads=1, debug=False, provider="cpu", model_type="", modeling_unit="cjkchar", bpe_vocab=""), lm_config=OfflineLMConfig(model="", scale=0.5), ctc_fst_decoder_config=OfflineCtcFstDecoderConfig(graph="", max_active=3000), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5, blank_penalty=0, rule_fsts="", rule_fars="")
Creating recognizer ...
Started
Done!

./sherpa-onnx-small-zipformer-ru-2024-09-18/test_wavs/1.wav
{"lang": "", "emotion": "", "event": "", "text": " родион потапыч высчитывал каждый новый вершок углубления и давно определил про себя", "timestamps": [0.00, 0.20, 0.28, 0.48, 0.68, 0.84, 0.92, 1.04, 1.48, 1.64, 1.76, 1.92, 2.08, 2.16, 2.40, 2.52, 2.60, 2.84, 3.00, 3.04, 3.20, 3.40, 3.48, 3.60, 3.68, 3.80, 3.88, 4.00, 4.12, 4.16, 4.72, 4.92, 5.12, 5.20, 5.48, 5.60, 5.68, 5.92, 6.28, 6.48], "tokens":[" ро", "ди", "он", " по", "та", "п", "ы", "ч", " вы", "с", "чи", "ты", "ва", "л", " ка", "жд", "ый", " но", "в", "ый", " вер", "ш", "о", "к", " у", "г", "лу", "б", "л", "ения", " и", " да", "в", "но", " оп", "ре", "дел", "ил", " про", " себя"], "words": []}
----
num threads: 1
decoding method: greedy_search
Elapsed seconds: 0.183 s
Real time factor (RTF): 0.183 / 7.080 = 0.026

Speech recognition from a microphone

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-microphone-offline \
  --tokens=./sherpa-onnx-small-zipformer-ru-2024-09-18/tokens.txt \
  --encoder=./sherpa-onnx-small-zipformer-ru-2024-09-18/encoder.int8.onnx \
  --decoder=./sherpa-onnx-small-zipformer-ru-2024-09-18/decoder.onnx \
  --joiner=./sherpa-onnx-small-zipformer-ru-2024-09-18/joiner.int8.onnx

Speech recognition from a microphone with VAD

cd /path/to/sherpa-onnx

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx

./build/bin/sherpa-onnx-vad-microphone-offline-asr \
  --silero-vad-model=./silero_vad.onnx \
  --tokens=./sherpa-onnx-small-zipformer-ru-2024-09-18/tokens.txt \
  --encoder=./sherpa-onnx-small-zipformer-ru-2024-09-18/encoder.int8.onnx \
  --decoder=./sherpa-onnx-small-zipformer-ru-2024-09-18/decoder.onnx \
  --joiner=./sherpa-onnx-small-zipformer-ru-2024-09-18/joiner.int8.onnx

sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01 (Japanese, 日语)

This model is from ReazonSpeech and supports only Japanese. It is trained by 35k hours of data.

The code for training the model can be found at https://github.com/k2-fsa/icefall/tree/master/egs/reazonspeech/ASR

Paper about the dataset is https://research.reazon.jp/_static/reazonspeech_nlp2023.pdf

In the following, we describe how to download it and use it with sherpa-onnx.

Hint

The original onnx model is from

https://huggingface.co/reazon-research/reazonspeech-k2-v2

Download the model

Please use the following commands to download it.

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01.tar.bz2

tar xvf sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01.tar.bz2
rm sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01.tar.bz2

ls -lh sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01

You should see the following output:

-rw-r--r--  1 fangjun  staff   1.2K Aug  1 18:32 README.md
-rw-r--r--  1 fangjun  staff   2.8M Aug  1 18:32 decoder-epoch-99-avg-1.int8.onnx
-rw-r--r--  1 fangjun  staff    11M Aug  1 18:32 decoder-epoch-99-avg-1.onnx
-rw-r--r--  1 fangjun  staff   148M Aug  1 18:32 encoder-epoch-99-avg-1.int8.onnx
-rw-r--r--  1 fangjun  staff   565M Aug  1 18:32 encoder-epoch-99-avg-1.onnx
-rw-r--r--  1 fangjun  staff   2.6M Aug  1 18:32 joiner-epoch-99-avg-1.int8.onnx
-rw-r--r--  1 fangjun  staff    10M Aug  1 18:32 joiner-epoch-99-avg-1.onnx
drwxr-xr-x  8 fangjun  staff   256B Aug  1 18:31 test_wavs
-rw-r--r--  1 fangjun  staff    45K Aug  1 18:32 tokens.txt

Decode wave files

Hint

It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate does not need to be 16 kHz.

fp32

The following code shows how to use fp32 models to decode wave files:

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline \
  --tokens=./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/encoder-epoch-99-avg-1.onnx \
  --decoder=./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/decoder-epoch-99-avg-1.onnx \
  --joiner=./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/joiner-epoch-99-avg-1.onnx \
  --num-threads=1 \
  ./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/test_wavs/1.wav

Note

Please use ./build/bin/Release/sherpa-onnx-offline.exe for Windows.

Caution

If you use Windows and get encoding issues, please run:

CHCP 65001

in your commandline.

You should see the following output:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:375 ./build/bin/sherpa-onnx-offline --tokens=./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/tokens.txt --encoder=./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/encoder-epoch-99-avg-1.onnx --decoder=./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/decoder-epoch-99-avg-1.onnx --joiner=./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/joiner-epoch-99-avg-1.onnx --num-threads=1 ./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/test_wavs/1.wav 

OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80, low_freq=20, high_freq=-400, dither=0), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/encoder-epoch-99-avg-1.onnx", decoder_filename="./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/decoder-epoch-99-avg-1.onnx", joiner_filename="./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/joiner-epoch-99-avg-1.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe", tail_paddings=-1), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model=""), wenet_ctc=OfflineWenetCtcModelConfig(model=""), sense_voice=OfflineSenseVoiceModelConfig(model="", language="auto", use_itn=False), telespeech_ctc="", tokens="./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/tokens.txt", num_threads=1, debug=False, provider="cpu", model_type="", modeling_unit="cjkchar", bpe_vocab=""), lm_config=OfflineLMConfig(model="", scale=0.5), ctc_fst_decoder_config=OfflineCtcFstDecoderConfig(graph="", max_active=3000), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5, blank_penalty=0, rule_fsts="", rule_fars="")
Creating recognizer ...
Started
Done!

./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/test_wavs/1.wav
{"text": "気象庁は雪や路面の凍結による交通への影響暴風雪や高波に警戒するとともに雪崩や屋根からの落雪にも十分注意するよう呼びかけています", "timestamps": [0.00, 0.48, 0.64, 0.88, 1.24, 1.44, 1.80, 2.00, 2.12, 2.40, 2.56, 2.80, 2.96, 3.04, 3.44, 3.60, 3.88, 4.00, 4.28, 4.40, 4.76, 4.96, 5.20, 5.40, 5.72, 5.92, 6.16, 6.48, 6.64, 6.88, 6.96, 7.08, 7.28, 7.48, 7.64, 8.00, 8.16, 8.36, 8.68, 8.80, 9.04, 9.12, 9.28, 9.64, 9.80, 10.00, 10.16, 10.44, 10.64, 10.92, 11.04, 11.24, 11.36, 11.52, 11.64, 11.88, 11.92, 12.16, 12.28, 12.44, 12.64, 13.16, 13.20], "tokens":["気", "象", "庁", "は", "雪", "や", "路", "面", "の", "凍", "結", "に", "よ", "る", "交", "通", "へ", "の", "影", "響", "暴", "風", "雪", "や", "高", "波", "に", "警", "戒", "す", "る", "と", "と", "も", "に", "雪", "崩", "や", "屋", "根", "か", "ら", "の", "落", "雪", "に", "も", "十", "分", "注", "意", "す", "る", "よ", "う", "呼", "び", "か", "け", "て", "い", "ま", "す"], "words": []}
----
num threads: 1
decoding method: greedy_search
Elapsed seconds: 1.101 s
Real time factor (RTF): 1.101 / 13.433 = 0.082

int8

The following code shows how to use int8 models to decode wave files:

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline \
  --tokens=./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/encoder-epoch-99-avg-1.int8.onnx \
  --decoder=./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/decoder-epoch-99-avg-1.onnx \
  --joiner=./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/joiner-epoch-99-avg-1.int8.onnx \
  --num-threads=1 \
  ./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/test_wavs/1.wav

Note

Please use ./build/bin/Release/sherpa-onnx-offline.exe for Windows.

Caution

If you use Windows and get encoding issues, please run:

CHCP 65001

in your commandline.

You should see the following output:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:375 ./build/bin/sherpa-onnx-offline --tokens=./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/tokens.txt --encoder=./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/encoder-epoch-99-avg-1.int8.onnx --decoder=./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/decoder-epoch-99-avg-1.onnx --joiner=./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/joiner-epoch-99-avg-1.int8.onnx --num-threads=1 ./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/test_wavs/1.wav 

OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80, low_freq=20, high_freq=-400, dither=0), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/encoder-epoch-99-avg-1.int8.onnx", decoder_filename="./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/decoder-epoch-99-avg-1.onnx", joiner_filename="./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/joiner-epoch-99-avg-1.int8.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe", tail_paddings=-1), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model=""), wenet_ctc=OfflineWenetCtcModelConfig(model=""), sense_voice=OfflineSenseVoiceModelConfig(model="", language="auto", use_itn=False), telespeech_ctc="", tokens="./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/tokens.txt", num_threads=1, debug=False, provider="cpu", model_type="", modeling_unit="cjkchar", bpe_vocab=""), lm_config=OfflineLMConfig(model="", scale=0.5), ctc_fst_decoder_config=OfflineCtcFstDecoderConfig(graph="", max_active=3000), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5, blank_penalty=0, rule_fsts="", rule_fars="")
Creating recognizer ...
Started
Done!

./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/test_wavs/1.wav
{"text": "気象庁は雪や路面の凍結による交通への影響暴風雪や高波に警戒するとともに雪崩や屋根からの落雪にも十分注意するよう呼びかけています", "timestamps": [0.00, 0.48, 0.64, 0.88, 1.24, 1.44, 1.80, 2.00, 2.12, 2.40, 2.56, 2.80, 2.96, 3.04, 3.44, 3.60, 3.88, 4.00, 4.28, 4.40, 4.76, 4.96, 5.20, 5.40, 5.72, 5.92, 6.20, 6.48, 6.64, 6.88, 6.96, 7.08, 7.28, 7.48, 7.64, 8.00, 8.16, 8.36, 8.68, 8.80, 9.04, 9.12, 9.28, 9.64, 9.80, 10.00, 10.16, 10.44, 10.64, 10.92, 11.04, 11.24, 11.36, 11.52, 11.60, 11.88, 11.92, 12.16, 12.28, 12.44, 12.64, 13.16, 13.20], "tokens":["気", "象", "庁", "は", "雪", "や", "路", "面", "の", "凍", "結", "に", "よ", "る", "交", "通", "へ", "の", "影", "響", "暴", "風", "雪", "や", "高", "波", "に", "警", "戒", "す", "る", "と", "と", "も", "に", "雪", "崩", "や", "屋", "根", "か", "ら", "の", "落", "雪", "に", "も", "十", "分", "注", "意", "す", "る", "よ", "う", "呼", "び", "か", "け", "て", "い", "ま", "す"], "words": []}
----
num threads: 1
decoding method: greedy_search
Elapsed seconds: 0.719 s
Real time factor (RTF): 0.719 / 13.433 = 0.054

Speech recognition from a microphone

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-microphone-offline \
  --tokens=./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/encoder-epoch-99-avg-1.int8.onnx \
  --decoder=./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/decoder-epoch-99-avg-1.onnx \
  --joiner=./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/joiner-epoch-99-avg-1.int8.onnx

Speech recognition from a microphone with VAD

cd /path/to/sherpa-onnx

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx

./build/bin/sherpa-onnx-vad-microphone-offline-asr \
  --silero-vad-model=./silero_vad.onnx \
  --tokens=./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/encoder-epoch-99-avg-1.int8.onnx \
  --decoder=./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/decoder-epoch-99-avg-1.onnx \
  --joiner=./sherpa-onnx-zipformer-ja-reazonspeech-2024-08-01/joiner-epoch-99-avg-1.int8.onnx

sherpa-onnx-zipformer-korean-2024-06-24 (Korean, 韩语)

PyTorch checkpoints of this model can be found at https://huggingface.co/johnBamma/icefall-asr-ksponspeech-zipformer-2024-06-24.

The training dataset can be found at https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=123.

Paper about the dataset is https://www.mdpi.com/2076-3417/10/19/6936

In the following, we describe how to download it and use it with sherpa-onnx.

Download the model

Please use the following commands to download it.

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-zipformer-korean-2024-06-24.tar.bz2

tar xf sherpa-onnx-zipformer-korean-2024-06-24.tar.bz2
rm sherpa-onnx-zipformer-korean-2024-06-24.tar.bz2

ls -lh sherpa-onnx-zipformer-korean-2024-06-24

You should see the following output:

-rw-r--r--  1 fangjun  staff   307K Jun 24 15:33 bpe.model
-rw-r--r--  1 fangjun  staff   2.7M Jun 24 15:33 decoder-epoch-99-avg-1.int8.onnx
-rw-r--r--  1 fangjun  staff    11M Jun 24 15:33 decoder-epoch-99-avg-1.onnx
-rw-r--r--  1 fangjun  staff    68M Jun 24 15:33 encoder-epoch-99-avg-1.int8.onnx
-rw-r--r--  1 fangjun  staff   249M Jun 24 15:33 encoder-epoch-99-avg-1.onnx
-rw-r--r--  1 fangjun  staff   2.5M Jun 24 15:33 joiner-epoch-99-avg-1.int8.onnx
-rw-r--r--  1 fangjun  staff   9.8M Jun 24 15:33 joiner-epoch-99-avg-1.onnx
drwxr-xr-x  7 fangjun  staff   224B Jun 24 15:32 test_wavs
-rw-r--r--  1 fangjun  staff    59K Jun 24 15:33 tokens.txt

Decode wave files

Hint

It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate does not need to be 16 kHz.

fp32

The following code shows how to use fp32 models to decode wave files:

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline \
  --tokens=./sherpa-onnx-zipformer-korean-2024-06-24/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-korean-2024-06-24/encoder-epoch-99-avg-1.onnx \
  --decoder=./sherpa-onnx-zipformer-korean-2024-06-24/decoder-epoch-99-avg-1.onnx \
  --joiner=./sherpa-onnx-zipformer-korean-2024-06-24/joiner-epoch-99-avg-1.onnx \
  ./sherpa-onnx-zipformer-korean-2024-06-24/test_wavs/0.wav

Note

Please use ./build/bin/Release/sherpa-onnx-offline.exe for Windows.

Caution

If you use Windows and get encoding issues, please run:

CHCP 65001

in your commandline.

You should see the following output:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:360 ./build/bin/sherpa-onnx-offline --tokens=./sherpa-onnx-zipformer-korean-2024-06-24/tokens.txt --encoder=./sherpa-onnx-zipformer-korean-2024-06-24/encoder-epoch-99-avg-1.onnx --decoder=./sherpa-onnx-zipformer-korean-2024-06-24/decoder-epoch-99-avg-1.onnx --joiner=./sherpa-onnx-zipformer-korean-2024-06-24/joiner-epoch-99-avg-1.onnx ./sherpa-onnx-zipformer-korean-2024-06-24/test_wavs/0.wav 

OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80, low_freq=20, high_freq=-400, dither=0), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-zipformer-korean-2024-06-24/encoder-epoch-99-avg-1.onnx", decoder_filename="./sherpa-onnx-zipformer-korean-2024-06-24/decoder-epoch-99-avg-1.onnx", joiner_filename="./sherpa-onnx-zipformer-korean-2024-06-24/joiner-epoch-99-avg-1.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe", tail_paddings=-1), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model=""), wenet_ctc=OfflineWenetCtcModelConfig(model=""), telespeech_ctc="", tokens="./sherpa-onnx-zipformer-korean-2024-06-24/tokens.txt", num_threads=2, debug=False, provider="cpu", model_type="", modeling_unit="cjkchar", bpe_vocab=""), lm_config=OfflineLMConfig(model="", scale=0.5), ctc_fst_decoder_config=OfflineCtcFstDecoderConfig(graph="", max_active=3000), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5, blank_penalty=0, rule_fsts="", rule_fars="")
Creating recognizer ...
Started
Done!

./sherpa-onnx-zipformer-korean-2024-06-24/test_wavs/0.wav
{"text": " 그는 괜찮은 척하려고 애쓰는 것 같았다.", "timestamps": [0.12, 0.24, 0.56, 1.00, 1.20, 1.32, 2.00, 2.16, 2.32, 2.52, 2.68, 2.80, 3.08, 3.28], "tokens":[" 그", "는", " 괜찮은", " 척", "하", "려고", " 애", "쓰", "는", " 것", " 같", "았", "다", "."], "words": []}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 0.119 s
Real time factor (RTF): 0.119 / 3.526 = 0.034

int8

The following code shows how to use int8 models to decode wave files:

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline \
  --tokens=./sherpa-onnx-zipformer-korean-2024-06-24/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-korean-2024-06-24/encoder-epoch-99-avg-1.int8.onnx \
  --decoder=./sherpa-onnx-zipformer-korean-2024-06-24/decoder-epoch-99-avg-1.onnx \
  --joiner=./sherpa-onnx-zipformer-korean-2024-06-24/joiner-epoch-99-avg-1.int8.onnx \
  ./sherpa-onnx-zipformer-korean-2024-06-24/test_wavs/0.wav

Note

Please use ./build/bin/Release/sherpa-onnx-offline.exe for Windows.

Caution

If you use Windows and get encoding issues, please run:

CHCP 65001

in your commandline.

You should see the following output:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:360 ./build/bin/sherpa-onnx-offline --tokens=./sherpa-onnx-zipformer-korean-2024-06-24/tokens.txt --encoder=./sherpa-onnx-zipformer-korean-2024-06-24/encoder-epoch-99-avg-1.int8.onnx --decoder=./sherpa-onnx-zipformer-korean-2024-06-24/decoder-epoch-99-avg-1.onnx --joiner=./sherpa-onnx-zipformer-korean-2024-06-24/joiner-epoch-99-avg-1.int8.onnx ./sherpa-onnx-zipformer-korean-2024-06-24/test_wavs/0.wav 

OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80, low_freq=20, high_freq=-400, dither=0), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-zipformer-korean-2024-06-24/encoder-epoch-99-avg-1.int8.onnx", decoder_filename="./sherpa-onnx-zipformer-korean-2024-06-24/decoder-epoch-99-avg-1.onnx", joiner_filename="./sherpa-onnx-zipformer-korean-2024-06-24/joiner-epoch-99-avg-1.int8.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe", tail_paddings=-1), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model=""), wenet_ctc=OfflineWenetCtcModelConfig(model=""), telespeech_ctc="", tokens="./sherpa-onnx-zipformer-korean-2024-06-24/tokens.txt", num_threads=2, debug=False, provider="cpu", model_type="", modeling_unit="cjkchar", bpe_vocab=""), lm_config=OfflineLMConfig(model="", scale=0.5), ctc_fst_decoder_config=OfflineCtcFstDecoderConfig(graph="", max_active=3000), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5, blank_penalty=0, rule_fsts="", rule_fars="")
Creating recognizer ...
Started
Done!

./sherpa-onnx-zipformer-korean-2024-06-24/test_wavs/0.wav
{"text": " 그는 괜찮은 척하려고 애쓰는 것 같았다.", "timestamps": [0.12, 0.24, 0.56, 1.00, 1.20, 1.32, 2.00, 2.16, 2.32, 2.52, 2.68, 2.84, 3.08, 3.28], "tokens":[" 그", "는", " 괜찮은", " 척", "하", "려고", " 애", "쓰", "는", " 것", " 같", "았", "다", "."], "words": []}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 0.092 s
Real time factor (RTF): 0.092 / 3.526 = 0.026

Speech recognition from a microphone

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-microphone-offline \
  --tokens=./sherpa-onnx-zipformer-korean-2024-06-24/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-korean-2024-06-24/encoder-epoch-99-avg-1.int8.onnx \
  --decoder=./sherpa-onnx-zipformer-korean-2024-06-24/decoder-epoch-99-avg-1.onnx \
  --joiner=./sherpa-onnx-zipformer-korean-2024-06-24/joiner-epoch-99-avg-1.int8.onnx

sherpa-onnx-zipformer-thai-2024-06-20 (Thai, 泰语)

PyTorch checkpoints of this model can be found at https://huggingface.co/yfyeung/icefall-asr-gigaspeech2-th-zipformer-2024-06-20.

The training dataset can be found at https://github.com/SpeechColab/GigaSpeech2.

The paper about the dataset is https://arxiv.org/pdf/2406.11546.

In the following, we describe how to download it and use it with sherpa-onnx.

Download the model

Please use the following commands to download it.

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-zipformer-thai-2024-06-20.tar.bz2

tar xf sherpa-onnx-zipformer-thai-2024-06-20.tar.bz2
rm sherpa-onnx-zipformer-thai-2024-06-20.tar.bz2

ls -lh sherpa-onnx-zipformer-thai-2024-06-20

You should see the following output:

-rw-r--r--  1 fangjun  staff   277K Jun 20 16:47 bpe.model
-rw-r--r--  1 fangjun  staff   1.2M Jun 20 16:47 decoder-epoch-12-avg-5.int8.onnx
-rw-r--r--  1 fangjun  staff   4.9M Jun 20 16:47 decoder-epoch-12-avg-5.onnx
-rw-r--r--  1 fangjun  staff   148M Jun 20 16:47 encoder-epoch-12-avg-5.int8.onnx
-rw-r--r--  1 fangjun  staff   565M Jun 20 16:47 encoder-epoch-12-avg-5.onnx
-rw-r--r--  1 fangjun  staff   1.0M Jun 20 16:47 joiner-epoch-12-avg-5.int8.onnx
-rw-r--r--  1 fangjun  staff   3.9M Jun 20 16:47 joiner-epoch-12-avg-5.onnx
drwxr-xr-x  6 fangjun  staff   192B Jun 20 16:46 test_wavs
-rw-r--r--  1 fangjun  staff    38K Jun 20 16:47 tokens.txt

Decode wave files

Hint

It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate does not need to be 16 kHz.

fp32

The following code shows how to use fp32 models to decode wave files:

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline \
  --tokens=./sherpa-onnx-zipformer-thai-2024-06-20/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-thai-2024-06-20/encoder-epoch-12-avg-5.onnx \
  --decoder=./sherpa-onnx-zipformer-thai-2024-06-20/decoder-epoch-12-avg-5.onnx \
  --joiner=./sherpa-onnx-zipformer-thai-2024-06-20/joiner-epoch-12-avg-5.onnx \
  ./sherpa-onnx-zipformer-thai-2024-06-20/test_wavs/0.wav

Note

Please use ./build/bin/Release/sherpa-onnx-offline.exe for Windows.

Caution

If you use Windows and get encoding issues, please run:

CHCP 65001

in your commandline.

You should see the following output:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:360 ./build/bin/sherpa-onnx-offline --tokens=./sherpa-onnx-zipformer-thai-2024-06-20/tokens.txt --encoder=./sherpa-onnx-zipformer-thai-2024-06-20/encoder-epoch-12-avg-5.onnx --decoder=./sherpa-onnx-zipformer-thai-2024-06-20/decoder-epoch-12-avg-5.onnx --joiner=./sherpa-onnx-zipformer-thai-2024-06-20/joiner-epoch-12-avg-5.onnx ./sherpa-onnx-zipformer-thai-2024-06-20/test_wavs/0.wav 

OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80, low_freq=20, high_freq=-400, dither=0), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-zipformer-thai-2024-06-20/encoder-epoch-12-avg-5.onnx", decoder_filename="./sherpa-onnx-zipformer-thai-2024-06-20/decoder-epoch-12-avg-5.onnx", joiner_filename="./sherpa-onnx-zipformer-thai-2024-06-20/joiner-epoch-12-avg-5.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe", tail_paddings=-1), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model=""), wenet_ctc=OfflineWenetCtcModelConfig(model=""), telespeech_ctc="", tokens="./sherpa-onnx-zipformer-thai-2024-06-20/tokens.txt", num_threads=2, debug=False, provider="cpu", model_type="", modeling_unit="cjkchar", bpe_vocab=""), lm_config=OfflineLMConfig(model="", scale=0.5), ctc_fst_decoder_config=OfflineCtcFstDecoderConfig(graph="", max_active=3000), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5, blank_penalty=0, rule_fsts="", rule_fars="")
Creating recognizer ...
Started
Done!

./sherpa-onnx-zipformer-thai-2024-06-20/test_wavs/0.wav
{"text": " แต่เดี๋ยวเกมในนัดต่อไปต้องไปเจอกับทางอินโดนีเซียอะไรอย่างนี้", "timestamps": [0.00, 0.08, 0.24, 0.44, 0.64, 0.84, 1.20, 1.84, 2.32, 2.64, 3.12, 3.64, 3.80, 3.88, 4.28], "tokens":[" แต่", "เดี๋ยว", "เกม", "ใน", "นัด", "ต่อไป", "ต้อง", "ไปเจอ", "กับ", "ทาง", "อิน", "โดน", "ี", "เซีย", "อะไรอย่างนี้"], "words": []}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 0.181 s
Real time factor (RTF): 0.181 / 4.496 = 0.040

int8

The following code shows how to use int8 models to decode wave files:

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline \
  --tokens=./sherpa-onnx-zipformer-thai-2024-06-20/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-thai-2024-06-20/encoder-epoch-12-avg-5.int8.onnx \
  --decoder=./sherpa-onnx-zipformer-thai-2024-06-20/decoder-epoch-12-avg-5.onnx \
  --joiner=./sherpa-onnx-zipformer-thai-2024-06-20/joiner-epoch-12-avg-5.int8.onnx \
  ./sherpa-onnx-zipformer-thai-2024-06-20/test_wavs/0.wav

Note

Please use ./build/bin/Release/sherpa-onnx-offline.exe for Windows.

Caution

If you use Windows and get encoding issues, please run:

CHCP 65001

in your commandline.

You should see the following output:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:360 ./build/bin/sherpa-onnx-offline --tokens=./sherpa-onnx-zipformer-thai-2024-06-20/tokens.txt --encoder=./sherpa-onnx-zipformer-thai-2024-06-20/encoder-epoch-12-avg-5.int8.onnx --decoder=./sherpa-onnx-zipformer-thai-2024-06-20/decoder-epoch-12-avg-5.onnx --joiner=./sherpa-onnx-zipformer-thai-2024-06-20/joiner-epoch-12-avg-5.int8.onnx ./sherpa-onnx-zipformer-thai-2024-06-20/test_wavs/0.wav 

OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80, low_freq=20, high_freq=-400, dither=0), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-zipformer-thai-2024-06-20/encoder-epoch-12-avg-5.int8.onnx", decoder_filename="./sherpa-onnx-zipformer-thai-2024-06-20/decoder-epoch-12-avg-5.onnx", joiner_filename="./sherpa-onnx-zipformer-thai-2024-06-20/joiner-epoch-12-avg-5.int8.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe", tail_paddings=-1), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model=""), wenet_ctc=OfflineWenetCtcModelConfig(model=""), telespeech_ctc="", tokens="./sherpa-onnx-zipformer-thai-2024-06-20/tokens.txt", num_threads=2, debug=False, provider="cpu", model_type="", modeling_unit="cjkchar", bpe_vocab=""), lm_config=OfflineLMConfig(model="", scale=0.5), ctc_fst_decoder_config=OfflineCtcFstDecoderConfig(graph="", max_active=3000), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5, blank_penalty=0, rule_fsts="", rule_fars="")
Creating recognizer ...
Started
Done!

./sherpa-onnx-zipformer-thai-2024-06-20/test_wavs/0.wav
{"text": " เดี๋ยวเกมในนัดต่อไปต้องไปเจอกับทางอินโดนีเซียนะครับ", "timestamps": [0.00, 0.24, 0.44, 0.64, 0.84, 1.20, 1.84, 2.32, 2.64, 3.12, 3.64, 3.80, 3.88, 4.28], "tokens":[" เดี๋ยว", "เกม", "ใน", "นัด", "ต่อไป", "ต้อง", "ไปเจอ", "กับ", "ทาง", "อิน", "โดน", "ี", "เซีย", "นะครับ"], "words": []}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 0.150 s
Real time factor (RTF): 0.150 / 4.496 = 0.033

Speech recognition from a microphone

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-microphone-offline \
  --tokens=./sherpa-onnx-zipformer-thai-2024-06-20/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-thai-2024-06-20/encoder-epoch-12-avg-5.int8.onnx \
  --decoder=./sherpa-onnx-zipformer-thai-2024-06-20/decoder-epoch-12-avg-5.onnx \
  --joiner=./sherpa-onnx-zipformer-thai-2024-06-20/joiner-epoch-12-avg-5.int8.onnx

sherpa-onnx-zipformer-cantonese-2024-03-13 (Cantonese, 粤语)

Training code for this model can be found at https://github.com/k2-fsa/icefall/pull/1537. It supports only Cantonese since it is trained on a Canatonese dataset. The paper for the dataset can be found at https://arxiv.org/pdf/2201.02419.pdf.

In the following, we describe how to download it and use it with sherpa-onnx.

Download the model

Please use the following commands to download it.

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-zipformer-cantonese-2024-03-13.tar.bz2

tar xf sherpa-onnx-zipformer-cantonese-2024-03-13.tar.bz2
rm sherpa-onnx-zipformer-cantonese-2024-03-13.tar.bz2

ls -lh sherpa-onnx-zipformer-cantonese-2024-03-13

You should see the following output:

total 340M
-rw-r--r-- 1 1001 127 2.7M Mar 13 09:06 decoder-epoch-45-avg-35.int8.onnx
-rw-r--r-- 1 1001 127  11M Mar 13 09:06 decoder-epoch-45-avg-35.onnx
-rw-r--r-- 1 1001 127  67M Mar 13 09:06 encoder-epoch-45-avg-35.int8.onnx
-rw-r--r-- 1 1001 127 248M Mar 13 09:06 encoder-epoch-45-avg-35.onnx
-rw-r--r-- 1 1001 127 2.4M Mar 13 09:06 joiner-epoch-45-avg-35.int8.onnx
-rw-r--r-- 1 1001 127 9.5M Mar 13 09:06 joiner-epoch-45-avg-35.onnx
drwxr-xr-x 2 1001 127 4.0K Mar 13 09:06 test_wavs
-rw-r--r-- 1 1001 127  42K Mar 13 09:06 tokens.txt

Decode wave files

Hint

It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate does not need to be 16 kHz.

fp32

The following code shows how to use fp32 models to decode wave files:

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline \
  --blank-penalty=1.2 \
  --tokens=./sherpa-onnx-zipformer-cantonese-2024-03-13/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-cantonese-2024-03-13/encoder-epoch-45-avg-35.onnx \
  --decoder=./sherpa-onnx-zipformer-cantonese-2024-03-13/decoder-epoch-45-avg-35.onnx \
  --joiner=./sherpa-onnx-zipformer-cantonese-2024-03-13/joiner-epoch-45-avg-35.onnx \
  ./sherpa-onnx-zipformer-cantonese-2024-03-13/test_wavs/test_wavs_1.wav \
  ./sherpa-onnx-zipformer-cantonese-2024-03-13/test_wavs/test_wavs_2.wav

Note

Please use ./build/bin/Release/sherpa-onnx-offline.exe for Windows.

Caution

If you use Windows and get encoding issues, please run:

CHCP 65001

in your commandline.

You should see the following output:

/project/sherpa-onnx/csrc/parse-options.cc:Read:361 sherpa-onnx-offline --blank-penalty=1.2 --tokens=./sherpa-onnx-zipformer-cantonese-2024-03-13/tokens.txt --encoder=./sherpa-onnx-zipformer-cantonese-2024-03-13/encoder-epoch-45-avg-35.onnx --decoder=./sherpa-onnx-zipformer-cantonese-2024-03-13/decoder-epoch-45-avg-35.onnx --joiner=./sherpa-onnx-zipformer-cantonese-2024-03-13/joiner-epoch-45-avg-35.onnx ./sherpa-onnx-zipformer-cantonese-2024-03-13/test_wavs/test_wavs_1.wav ./sherpa-onnx-zipformer-cantonese-2024-03-13/test_wavs/test_wavs_2.wav 

OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-zipformer-cantonese-2024-03-13/encoder-epoch-45-avg-35.onnx", decoder_filename="./sherpa-onnx-zipformer-cantonese-2024-03-13/decoder-epoch-45-avg-35.onnx", joiner_filename="./sherpa-onnx-zipformer-cantonese-2024-03-13/joiner-epoch-45-avg-35.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe", tail_paddings=-1), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model=""), wenet_ctc=OfflineWenetCtcModelConfig(model=""), tokens="./sherpa-onnx-zipformer-cantonese-2024-03-13/tokens.txt", num_threads=2, debug=False, provider="cpu", model_type=""), lm_config=OfflineLMConfig(model="", scale=0.5), ctc_fst_decoder_config=OfflineCtcFstDecoderConfig(graph="", max_active=3000), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5, blank_penalty=1.2)
Creating recognizer ...
Started
Done!

./sherpa-onnx-zipformer-cantonese-2024-03-13/test_wavs/test_wavs_1.wav
{"text": "啊有冇人知道灣仔活道係點去㗎", "timestamps": [0.00, 0.88, 1.28, 1.52, 1.84, 2.08, 2.32, 2.56, 2.80, 3.04, 3.20, 3.44, 3.68, 3.92], "tokens":["啊", "有", "冇", "人", "知", "道", "灣", "仔", "活", "道", "係", "點", "去", "㗎"]}
----
./sherpa-onnx-zipformer-cantonese-2024-03-13/test_wavs/test_wavs_2.wav
{"text": "我喺黃大仙九龍塘聯合到當失路啊", "timestamps": [0.00, 0.64, 0.88, 1.12, 1.28, 1.60, 1.80, 2.16, 2.36, 2.56, 2.88, 3.08, 3.32, 3.44, 3.60], "tokens":["我", "喺", "黃", "大", "仙", "九", "龍", "塘", "聯", "合", "到", "當", "失", "路", "啊"]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 1.349 s
Real time factor (RTF): 1.349 / 10.320 = 0.131

int8

The following code shows how to use int8 models to decode wave files:

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline \
  --blank-penalty=1.2 \
  --tokens=./sherpa-onnx-zipformer-cantonese-2024-03-13/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-cantonese-2024-03-13/encoder-epoch-45-avg-35.int8.onnx \
  --decoder=./sherpa-onnx-zipformer-cantonese-2024-03-13/decoder-epoch-45-avg-35.onnx \
  --joiner=./sherpa-onnx-zipformer-cantonese-2024-03-13/joiner-epoch-45-avg-35.int8.onnx \
  ./sherpa-onnx-zipformer-cantonese-2024-03-13/test_wavs/test_wavs_1.wav \
  ./sherpa-onnx-zipformer-cantonese-2024-03-13/test_wavs/test_wavs_2.wav

Note

Please use ./build/bin/Release/sherpa-onnx-offline.exe for Windows.

Caution

If you use Windows and get encoding issues, please run:

CHCP 65001

in your commandline.

You should see the following output:

/project/sherpa-onnx/csrc/parse-options.cc:Read:361 sherpa-onnx-offline --blank-penalty=1.2 --tokens=./sherpa-onnx-zipformer-cantonese-2024-03-13/tokens.txt --encoder=./sherpa-onnx-zipformer-cantonese-2024-03-13/encoder-epoch-45-avg-35.int8.onnx --decoder=./sherpa-onnx-zipformer-cantonese-2024-03-13/decoder-epoch-45-avg-35.onnx --joiner=./sherpa-onnx-zipformer-cantonese-2024-03-13/joiner-epoch-45-avg-35.int8.onnx ./sherpa-onnx-zipformer-cantonese-2024-03-13/test_wavs/test_wavs_1.wav ./sherpa-onnx-zipformer-cantonese-2024-03-13/test_wavs/test_wavs_2.wav 

OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-zipformer-cantonese-2024-03-13/encoder-epoch-45-avg-35.int8.onnx", decoder_filename="./sherpa-onnx-zipformer-cantonese-2024-03-13/decoder-epoch-45-avg-35.onnx", joiner_filename="./sherpa-onnx-zipformer-cantonese-2024-03-13/joiner-epoch-45-avg-35.int8.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe", tail_paddings=-1), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model=""), wenet_ctc=OfflineWenetCtcModelConfig(model=""), tokens="./sherpa-onnx-zipformer-cantonese-2024-03-13/tokens.txt", num_threads=2, debug=False, provider="cpu", model_type=""), lm_config=OfflineLMConfig(model="", scale=0.5), ctc_fst_decoder_config=OfflineCtcFstDecoderConfig(graph="", max_active=3000), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5, blank_penalty=1.2)
Creating recognizer ...
Started
Done!

./sherpa-onnx-zipformer-cantonese-2024-03-13/test_wavs/test_wavs_1.wav
{"text": "啊有冇人知道灣仔活道係點去㗎", "timestamps": [0.00, 0.88, 1.28, 1.52, 1.84, 2.08, 2.32, 2.56, 2.80, 3.04, 3.20, 3.44, 3.68, 3.92], "tokens":["啊", "有", "冇", "人", "知", "道", "灣", "仔", "活", "道", "係", "點", "去", "㗎"]}
----
./sherpa-onnx-zipformer-cantonese-2024-03-13/test_wavs/test_wavs_2.wav
{"text": "我喺黃大仙九龍塘聯合到當失路啊", "timestamps": [0.00, 0.64, 0.88, 1.12, 1.28, 1.60, 1.80, 2.16, 2.36, 2.56, 2.88, 3.08, 3.32, 3.44, 3.60], "tokens":["我", "喺", "黃", "大", "仙", "九", "龍", "塘", "聯", "合", "到", "當", "失", "路", "啊"]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 0.907 s
Real time factor (RTF): 0.907 / 10.320 = 0.088

Speech recognition from a microphone

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-microphone-offline \
  --tokens=./sherpa-onnx-zipformer-cantonese-2024-03-13/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-cantonese-2024-03-13/encoder-epoch-45-avg-35.int8.onnx \
  --decoder=./sherpa-onnx-zipformer-cantonese-2024-03-13/decoder-epoch-45-avg-35.onnx \
  --joiner=./sherpa-onnx-zipformer-cantonese-2024-03-13/joiner-epoch-45-avg-35.int8.onnx

sherpa-onnx-zipformer-gigaspeech-2023-12-12 (English)

Training code for this model is https://github.com/k2-fsa/icefall/pull/1254. It supports only English since it is trained on the GigaSpeech dataset.

In the following, we describe how to download it and use it with sherpa-onnx.

Download the model

Please use the following commands to download it.

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-zipformer-gigaspeech-2023-12-12.tar.bz2

tar xf sherpa-onnx-zipformer-gigaspeech-2023-12-12.tar.bz2
rm sherpa-onnx-zipformer-gigaspeech-2023-12-12.tar.bz2
ls -lh sherpa-onnx-zipformer-gigaspeech-2023-12-12

You should see the following output:

$ ls -lh sherpa-onnx-zipformer-gigaspeech-2023-12-12
total 656184
-rw-r--r--  1 fangjun  staff    28B Dec 12 19:00 README.md
-rw-r--r--  1 fangjun  staff   239K Dec 12 19:00 bpe.model
-rw-r--r--  1 fangjun  staff   528K Dec 12 19:00 decoder-epoch-30-avg-1.int8.onnx
-rw-r--r--  1 fangjun  staff   2.0M Dec 12 19:00 decoder-epoch-30-avg-1.onnx
-rw-r--r--  1 fangjun  staff    68M Dec 12 19:00 encoder-epoch-30-avg-1.int8.onnx
-rw-r--r--  1 fangjun  staff   249M Dec 12 19:00 encoder-epoch-30-avg-1.onnx
-rw-r--r--  1 fangjun  staff   253K Dec 12 19:00 joiner-epoch-30-avg-1.int8.onnx
-rw-r--r--  1 fangjun  staff   1.0M Dec 12 19:00 joiner-epoch-30-avg-1.onnx
drwxr-xr-x  5 fangjun  staff   160B Dec 12 19:00 test_wavs
-rw-r--r--  1 fangjun  staff   4.9K Dec 12 19:00 tokens.txt

Decode wave files

Hint

It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate does not need to be 16 kHz.

fp32

The following code shows how to use fp32 models to decode wave files:

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline \
  --tokens=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/encoder-epoch-30-avg-1.onnx \
  --decoder=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/decoder-epoch-30-avg-1.onnx \
  --joiner=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/joiner-epoch-30-avg-1.onnx \
  ./sherpa-onnx-zipformer-gigaspeech-2023-12-12/test_wavs/1089-134686-0001.wav \
  ./sherpa-onnx-zipformer-gigaspeech-2023-12-12/test_wavs/1221-135766-0001.wav \
  ./sherpa-onnx-zipformer-gigaspeech-2023-12-12/test_wavs/1221-135766-0002.wav

Note

Please use ./build/bin/Release/sherpa-onnx-offline.exe for Windows.

You should see the following output:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./build/bin/sherpa-onnx-offline --tokens=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/tokens.txt --encoder=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/encoder-epoch-30-avg-1.onnx --decoder=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/decoder-epoch-30-avg-1.onnx --joiner=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/joiner-epoch-30-avg-1.onnx ./sherpa-onnx-zipformer-gigaspeech-2023-12-12/test_wavs/1089-134686-0001.wav ./sherpa-onnx-zipformer-gigaspeech-2023-12-12/test_wavs/1221-135766-0001.wav ./sherpa-onnx-zipformer-gigaspeech-2023-12-12/test_wavs/1221-135766-0002.wav 

OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-zipformer-gigaspeech-2023-12-12/encoder-epoch-30-avg-1.onnx", decoder_filename="./sherpa-onnx-zipformer-gigaspeech-2023-12-12/decoder-epoch-30-avg-1.onnx", joiner_filename="./sherpa-onnx-zipformer-gigaspeech-2023-12-12/joiner-epoch-30-avg-1.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe", tail_paddings=-1), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model=""), wenet_ctc=OfflineWenetCtcModelConfig(model=""), tokens="./sherpa-onnx-zipformer-gigaspeech-2023-12-12/tokens.txt", num_threads=2, debug=False, provider="cpu", model_type=""), lm_config=OfflineLMConfig(model="", scale=0.5), ctc_fst_decoder_config=OfflineCtcFstDecoderConfig(graph="", max_active=3000), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5)
Creating recognizer ...
Started
Done!

./sherpa-onnx-zipformer-gigaspeech-2023-12-12/test_wavs/1089-134686-0001.wav
{"text": " AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS", "timestamps": [0.00, 0.36, 0.52, 0.68, 0.96, 1.00, 1.08, 1.28, 1.40, 1.48, 1.60, 1.76, 1.80, 1.88, 1.92, 2.00, 2.20, 2.32, 2.36, 2.48, 2.60, 2.80, 2.84, 2.92, 3.12, 3.32, 3.56, 3.76, 4.04, 4.20, 4.32, 4.40, 4.56, 4.80, 4.92, 5.08, 5.36, 5.48, 5.64, 5.72, 5.88, 6.04, 6.24], "tokens":[" AFTER", " E", "AR", "LY", " ", "N", "IGHT", "F", "AL", "L", " THE", " ", "Y", "E", "LL", "OW", " LA", "M", "P", "S", " WOULD", " ", "L", "IGHT", " UP", " HERE", " AND", " THERE", " THE", " S", "QU", "AL", "ID", " QU", "AR", "TER", " OF", " THE", " B", "RO", "TH", "EL", "S"]}
----
./sherpa-onnx-zipformer-gigaspeech-2023-12-12/test_wavs/1221-135766-0001.wav
{"text": " GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONORED BOSOM TO CONNECT HER PARENT FOR EVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN", "timestamps": [0.00, 0.16, 0.40, 0.68, 0.84, 0.96, 1.04, 1.12, 1.32, 1.52, 1.68, 1.76, 2.00, 2.12, 2.28, 2.40, 2.64, 2.92, 3.20, 3.32, 3.52, 3.64, 3.76, 3.96, 4.12, 4.36, 4.52, 4.72, 4.92, 5.16, 5.40, 5.64, 5.76, 5.88, 6.12, 6.28, 6.48, 6.84, 7.08, 7.32, 7.60, 7.92, 8.12, 8.24, 8.36, 8.48, 8.64, 8.76, 8.88, 9.12, 9.32, 9.48, 9.56, 9.60, 9.76, 10.00, 10.12, 10.20, 10.44, 10.68, 10.80, 11.00, 11.20, 11.36, 11.52, 11.76, 12.00, 12.12, 12.24, 12.28, 12.52, 12.72, 12.84, 12.96, 13.04, 13.24, 13.40, 13.64, 13.76, 14.00, 14.08, 14.24, 14.52, 14.68, 14.80, 15.00, 15.04, 15.28, 15.52, 15.76, 16.00, 16.12, 16.20, 16.32], "tokens":[" GO", "D", " AS", " A", " DI", "RE", "C", "T", " CON", "SE", "QU", "ENCE", " OF", " THE", " S", "IN", " WHICH", " MAN", " TH", "US", " P", "UN", "ISH", "ED", " HAD", " GIVE", "N", " HER", " A", " LOVE", "LY", " CHI", "L", "D", " WHO", "SE", " PLACE", " WAS", " ON", " THAT", " SAME", " DIS", "HO", "N", "OR", "ED", " BO", "S", "OM", " TO", " CON", "NE", "C", "T", " HER", " PA", "R", "ENT", " FOR", " E", "VER", " WITH", " THE", " RA", "CE", " AND", " DE", "S", "C", "ENT", " OF", " MO", "R", "T", "AL", "S", " AND", " TO", " BE", " F", "IN", "ALLY", " A", " B", "LES", "S", "ED", " SO", "UL", " IN", " HE", "A", "VE", "N"]}
----
./sherpa-onnx-zipformer-gigaspeech-2023-12-12/test_wavs/1221-135766-0002.wav
{"text": " YET THESE THOUGHTS AFFECTED HESTER PRYNE LESS WITH HOPE THAN APPREHENSION", "timestamps": [0.00, 0.04, 0.12, 0.40, 0.68, 0.88, 0.96, 1.12, 1.20, 1.32, 1.44, 1.48, 1.64, 1.76, 1.88, 2.04, 2.16, 2.28, 2.52, 2.68, 2.72, 2.88, 3.12, 3.28, 3.52, 3.80, 4.00, 4.16, 4.24, 4.40, 4.48], "tokens":[" ", "Y", "ET", " THESE", " THOUGH", "T", "S", " A", "FF", "E", "C", "TED", " HE", "S", "TER", " P", "RY", "NE", " LE", "S", "S", " WITH", " HO", "PE", " THAN", " APP", "RE", "HE", "N", "S", "ION"]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 1.407 s
Real time factor (RTF): 1.407 / 28.165 = 0.050

int8

The following code shows how to use int8 models to decode wave files:

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline \
  --tokens=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/encoder-epoch-30-avg-1.int8.onnx \
  --decoder=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/decoder-epoch-30-avg-1.onnx \
  --joiner=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/joiner-epoch-30-avg-1.int8.onnx \
  ./sherpa-onnx-zipformer-gigaspeech-2023-12-12/test_wavs/1089-134686-0001.wav \
  ./sherpa-onnx-zipformer-gigaspeech-2023-12-12/test_wavs/1221-135766-0001.wav \
  ./sherpa-onnx-zipformer-gigaspeech-2023-12-12/test_wavs/1221-135766-0002.wav

Note

Please use ./build/bin/Release/sherpa-onnx-offline.exe for Windows.

You should see the following output:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./build/bin/sherpa-onnx-offline --tokens=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/tokens.txt --encoder=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/encoder-epoch-30-avg-1.int8.onnx --decoder=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/decoder-epoch-30-avg-1.onnx --joiner=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/joiner-epoch-30-avg-1.int8.onnx ./sherpa-onnx-zipformer-gigaspeech-2023-12-12/test_wavs/1089-134686-0001.wav ./sherpa-onnx-zipformer-gigaspeech-2023-12-12/test_wavs/1221-135766-0001.wav ./sherpa-onnx-zipformer-gigaspeech-2023-12-12/test_wavs/1221-135766-0002.wav 

OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-zipformer-gigaspeech-2023-12-12/encoder-epoch-30-avg-1.int8.onnx", decoder_filename="./sherpa-onnx-zipformer-gigaspeech-2023-12-12/decoder-epoch-30-avg-1.onnx", joiner_filename="./sherpa-onnx-zipformer-gigaspeech-2023-12-12/joiner-epoch-30-avg-1.int8.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe", tail_paddings=-1), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model=""), wenet_ctc=OfflineWenetCtcModelConfig(model=""), tokens="./sherpa-onnx-zipformer-gigaspeech-2023-12-12/tokens.txt", num_threads=2, debug=False, provider="cpu", model_type=""), lm_config=OfflineLMConfig(model="", scale=0.5), ctc_fst_decoder_config=OfflineCtcFstDecoderConfig(graph="", max_active=3000), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5)
Creating recognizer ...
Started
Done!

./sherpa-onnx-zipformer-gigaspeech-2023-12-12/test_wavs/1089-134686-0001.wav
{"text": " AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS", "timestamps": [0.00, 0.36, 0.52, 0.68, 0.96, 1.00, 1.08, 1.28, 1.40, 1.48, 1.60, 1.76, 1.80, 1.88, 1.92, 2.00, 2.20, 2.32, 2.36, 2.48, 2.60, 2.80, 2.84, 2.92, 3.12, 3.32, 3.56, 3.76, 4.04, 4.24, 4.32, 4.40, 4.56, 4.80, 4.92, 5.08, 5.36, 5.48, 5.64, 5.72, 5.88, 6.04, 6.24], "tokens":[" AFTER", " E", "AR", "LY", " ", "N", "IGHT", "F", "AL", "L", " THE", " ", "Y", "E", "LL", "OW", " LA", "M", "P", "S", " WOULD", " ", "L", "IGHT", " UP", " HERE", " AND", " THERE", " THE", " S", "QU", "AL", "ID", " QU", "AR", "TER", " OF", " THE", " B", "RO", "TH", "EL", "S"]}
----
./sherpa-onnx-zipformer-gigaspeech-2023-12-12/test_wavs/1221-135766-0001.wav
{"text": " GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONORED BOSOM TO CONNECT HER PARENT FOR EVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN", "timestamps": [0.00, 0.16, 0.40, 0.68, 0.84, 0.96, 1.08, 1.12, 1.32, 1.52, 1.68, 1.76, 2.00, 2.12, 2.28, 2.40, 2.64, 2.92, 3.20, 3.32, 3.52, 3.64, 3.76, 3.96, 4.12, 4.36, 4.52, 4.72, 4.92, 5.16, 5.40, 5.64, 5.76, 5.88, 6.12, 6.28, 6.52, 6.84, 7.08, 7.32, 7.60, 7.92, 8.12, 8.24, 8.36, 8.48, 8.64, 8.76, 8.88, 9.12, 9.32, 9.48, 9.56, 9.60, 9.76, 10.00, 10.12, 10.20, 10.44, 10.68, 10.80, 11.00, 11.20, 11.36, 11.52, 11.76, 12.00, 12.12, 12.24, 12.28, 12.52, 12.72, 12.84, 12.96, 13.04, 13.24, 13.44, 13.64, 13.76, 14.00, 14.08, 14.24, 14.52, 14.68, 14.80, 15.00, 15.04, 15.28, 15.48, 15.76, 16.00, 16.12, 16.16, 16.32], "tokens":[" GO", "D", " AS", " A", " DI", "RE", "C", "T", " CON", "SE", "QU", "ENCE", " OF", " THE", " S", "IN", " WHICH", " MAN", " TH", "US", " P", "UN", "ISH", "ED", " HAD", " GIVE", "N", " HER", " A", " LOVE", "LY", " CHI", "L", "D", " WHO", "SE", " PLACE", " WAS", " ON", " THAT", " SAME", " DIS", "HO", "N", "OR", "ED", " BO", "S", "OM", " TO", " CON", "NE", "C", "T", " HER", " PA", "R", "ENT", " FOR", " E", "VER", " WITH", " THE", " RA", "CE", " AND", " DE", "S", "C", "ENT", " OF", " MO", "R", "T", "AL", "S", " AND", " TO", " BE", " F", "IN", "ALLY", " A", " B", "LES", "S", "ED", " SO", "UL", " IN", " HE", "A", "VE", "N"]}
----
./sherpa-onnx-zipformer-gigaspeech-2023-12-12/test_wavs/1221-135766-0002.wav
{"text": " YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION", "timestamps": [0.00, 0.04, 0.12, 0.40, 0.68, 0.88, 0.96, 1.12, 1.24, 1.32, 1.44, 1.48, 1.64, 1.76, 1.88, 2.04, 2.16, 2.28, 2.32, 2.52, 2.68, 2.72, 2.88, 3.12, 3.32, 3.52, 3.80, 4.00, 4.16, 4.24, 4.40, 4.48], "tokens":[" ", "Y", "ET", " THESE", " THOUGH", "T", "S", " A", "FF", "E", "C", "TED", " HE", "S", "TER", " P", "RY", "N", "NE", " LE", "S", "S", " WITH", " HO", "PE", " THAN", " APP", "RE", "HE", "N", "S", "ION"]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 1.101 s
Real time factor (RTF): 1.101 / 28.165 = 0.039

Speech recognition from a microphone

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-microphone-offline \
  --tokens=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/encoder-epoch-30-avg-1.int8.onnx \
  --decoder=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/decoder-epoch-30-avg-1.onnx \
  --joiner=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/joiner-epoch-30-avg-1.int8.onnx

Speech recognition from a microphone with VAD

cd /path/to/sherpa-onnx

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx

./build/bin/sherpa-onnx-vad-microphone-offline-asr \
  --silero-vad-model=./silero_vad.onnx \
  --tokens=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/encoder-epoch-30-avg-1.int8.onnx \
  --decoder=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/decoder-epoch-30-avg-1.onnx \
  --joiner=./sherpa-onnx-zipformer-gigaspeech-2023-12-12/joiner-epoch-30-avg-1.int8.onnx

zrjin/sherpa-onnx-zipformer-multi-zh-hans-2023-9-2 (Chinese)

This model is from

https://huggingface.co/zrjin/sherpa-onnx-zipformer-multi-zh-hans-2023-9-2

which supports Chinese as it is trained on whatever datasets involved in the multi-zh_hans recipe.

If you are interested in how the model is trained, please refer to https://github.com/k2-fsa/icefall/pull/1238.

In the following, we describe how to download it and use it with sherpa-onnx.

Download the model

Please use the following commands to download it.

cd /path/to/sherpa-onnx

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-zipformer-multi-zh-hans-2023-9-2.tar.bz2

tar xvf sherpa-onnx-zipformer-multi-zh-hans-2023-9-2.tar.bz2
rm sherpa-onnx-zipformer-multi-zh-hans-2023-9-2.tar.bz2

Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.

sherpa-onnx-zipformer-multi-zh-hans-2023-9-2 zengruijin$ ls -lh *.onnx
-rw-rw-r--@ 1 zengruijin  staff   1.2M Sep 18 07:04 decoder-epoch-20-avg-1.int8.onnx
-rw-rw-r--@ 1 zengruijin  staff   4.9M Sep 18 07:04 decoder-epoch-20-avg-1.onnx
-rw-rw-r--@ 1 zengruijin  staff    66M Sep 18 07:04 encoder-epoch-20-avg-1.int8.onnx
-rw-rw-r--@ 1 zengruijin  staff   248M Sep 18 07:05 encoder-epoch-20-avg-1.onnx
-rw-rw-r--@ 1 zengruijin  staff   1.0M Sep 18 07:05 joiner-epoch-20-avg-1.int8.onnx
-rw-rw-r--@ 1 zengruijin  staff   3.9M Sep 18 07:05 joiner-epoch-20-avg-1.onnx

Decode wave files

Hint

It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate does not need to be 16 kHz.

fp32

The following code shows how to use fp32 models to decode wave files:

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline \
  --tokens=./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/encoder-epoch-20-avg-1.onnx \
  --decoder=./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/decoder-epoch-20-avg-1.onnx \
  --joiner=./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/joiner-epoch-20-avg-1.onnx \
  ./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/test_wavs/0.wav \
  ./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/test_wavs/1.wav \
  ./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/test_wavs/8k.wav

Note

Please use ./build/bin/Release/sherpa-onnx-offline.exe for Windows.

You should see the following output:

/Users/runner/work/sherpa-onnx/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 sherpa-onnx-offline --tokens=./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/tokens.txt --encoder=./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/encoder-epoch-20-avg-1.onnx --decoder=./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/decoder-epoch-20-avg-1.onnx --joiner=./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/joiner-epoch-20-avg-1.onnx ./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/test_wavs/0.wav ./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/test_wavs/1.wav ./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/test_wavs/8k.wav 

OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/encoder-epoch-20-avg-1.onnx", decoder_filename="./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/decoder-epoch-20-avg-1.onnx", joiner_filename="./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/joiner-epoch-20-avg-1.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe"), tdnn=OfflineTdnnModelConfig(model=""), tokens="./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/tokens.txt", num_threads=2, debug=False, provider="cpu", model_type=""), lm_config=OfflineLMConfig(model="", scale=0.5), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5)
Creating recognizer ...
Started
/Users/runner/work/sherpa-onnx/sherpa-onnx/sherpa-onnx/csrc/offline-stream.cc:AcceptWaveformImpl:117 Creating a resampler:
   in_sample_rate: 8000
   output_sample_rate: 16000

Done!

./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/test_wavs/0.wav
{"text":" 对我做了介绍那么我想说的是大家如果对我的研究感兴趣","timestamps":"[0.00, 0.16, 0.40, 0.60, 0.84, 1.08, 1.60, 1.72, 1.88, 2.04, 2.24, 2.44, 2.60, 2.96, 3.12, 3.32, 3.40, 3.60, 3.72, 3.84, 4.00, 4.16, 4.32, 4.52, 4.68]","tokens":[" 对","我","做","了","介","绍","那","么","我","想","说","的","是","大","家","如","果","对","我","的","研","究","感","兴","趣"]}
----
./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/test_wavs/1.wav
{"text":" 重点想谈三个问题首先就是这一轮全球金融动<0xE8><0x8D><0xA1>的表现","timestamps":"[0.00, 0.12, 0.48, 0.68, 0.92, 1.12, 1.28, 1.48, 1.80, 2.04, 2.40, 2.56, 2.76, 2.96, 3.08, 3.32, 3.48, 3.68, 3.84, 4.00, 4.20, 4.24, 4.28, 4.40, 4.60, 4.84]","tokens":[" 重","点","想","谈","三","个","问","题","首","先","就","是","这","一","轮","全","球","金","融","动","<0xE8>","<0x8D>","<0xA1>","的","表","现"]}
----
./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/test_wavs/8k.wav
{"text":" 深入地分析这一次全球金融动<0xE8><0x8D><0xA1>背后的根源","timestamps":"[0.00, 0.04, 0.24, 0.52, 0.76, 1.00, 1.40, 1.64, 1.80, 2.12, 2.32, 2.64, 2.80, 3.00, 3.20, 3.24, 3.28, 3.44, 3.64, 3.76, 3.96, 4.20]","tokens":[" ","深","入","地","分","析","这","一","次","全","球","金","融","动","<0xE8>","<0x8D>","<0xA1>","背","后","的","根","源"]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 0.362 s
Real time factor (RTF): 0.362 / 15.289 = 0.024

int8

The following code shows how to use int8 models to decode wave files:

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline \
  --tokens=./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/encoder-epoch-20-avg-1.int8.onnx \
  --decoder=./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/decoder-epoch-20-avg-1.onnx \
  --joiner=./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/joiner-epoch-20-avg-1.int8.onnx \
  ./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/test_wavs/0.wav \
  ./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/test_wavs/1.wav \
  ./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/test_wavs/8k.wav

Note

Please use ./build/bin/Release/sherpa-onnx-offline.exe for Windows.

You should see the following output:

/Users/runner/work/sherpa-onnx/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 sherpa-onnx-offline --tokens=./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/tokens.txt --encoder=./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/encoder-epoch-20-avg-1.int8.onnx --decoder=./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/decoder-epoch-20-avg-1.onnx --joiner=./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/joiner-epoch-20-avg-1.int8.onnx ./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/test_wavs/0.wav ./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/test_wavs/1.wav ./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/test_wavs/8k.wav 

OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/encoder-epoch-20-avg-1.int8.onnx", decoder_filename="./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/decoder-epoch-20-avg-1.onnx", joiner_filename="./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/joiner-epoch-20-avg-1.int8.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe"), tdnn=OfflineTdnnModelConfig(model=""), tokens="./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/tokens.txt", num_threads=2, debug=False, provider="cpu", model_type=""), lm_config=OfflineLMConfig(model="", scale=0.5), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5)
Creating recognizer ...
Started
/Users/runner/work/sherpa-onnx/sherpa-onnx/sherpa-onnx/csrc/offline-stream.cc:AcceptWaveformImpl:117 Creating a resampler:
   in_sample_rate: 8000
   output_sample_rate: 16000

Done!

./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/test_wavs/0.wav
{"text":" 对我做了介绍那么我想说的是大家如果对我的研究感兴趣","timestamps":"[0.00, 0.16, 0.40, 0.60, 0.84, 1.08, 1.60, 1.72, 1.88, 2.04, 2.28, 2.44, 2.60, 2.96, 3.12, 3.32, 3.40, 3.60, 3.76, 3.84, 4.00, 4.16, 4.32, 4.52, 4.56]","tokens":[" 对","我","做","了","介","绍","那","么","我","想","说","的","是","大","家","如","果","对","我","的","研","究","感","兴","趣"]}
----
./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/test_wavs/1.wav
{"text":" 重点想谈三个问题首先就是这一轮全球金融动<0xE8><0x8D><0xA1>的表现","timestamps":"[0.00, 0.12, 0.48, 0.68, 0.92, 1.12, 1.28, 1.48, 1.80, 2.04, 2.40, 2.56, 2.76, 2.96, 3.08, 3.32, 3.48, 3.68, 3.84, 4.00, 4.20, 4.24, 4.28, 4.40, 4.60, 4.84]","tokens":[" 重","点","想","谈","三","个","问","题","首","先","就","是","这","一","轮","全","球","金","融","动","<0xE8>","<0x8D>","<0xA1>","的","表","现"]}
----
./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/test_wavs/8k.wav
{"text":" 深入地分析这一次全球金融动<0xE8><0x8D><0xA1>背后的根源","timestamps":"[0.00, 0.04, 0.24, 0.52, 0.76, 1.00, 1.40, 1.64, 1.80, 2.12, 2.36, 2.64, 2.80, 3.04, 3.16, 3.20, 3.24, 3.44, 3.64, 3.76, 3.96, 4.20]","tokens":[" ","深","入","地","分","析","这","一","次","全","球","金","融","动","<0xE8>","<0x8D>","<0xA1>","背","后","的","根","源"]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 0.305 s
Real time factor (RTF): 0.305 / 15.289 = 0.020

Speech recognition from a microphone

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-microphone-offline \
  --tokens=./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/encoder-epoch-20-avg-1.onnx \
  --decoder=./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/decoder-epoch-20-avg-1.onnx \
  --joiner=./sherpa-onnx-zipformer-multi-zh-hans-2023-9-2/joiner-epoch-20-avg-1.onnx

yfyeung/icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17 (English)

This model is from

https://huggingface.co/yfyeung/icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17

which supports only English as it is trained on the CommonVoice English dataset.

If you are interested in how the model is trained, please refer to https://github.com/k2-fsa/icefall/pull/997.

In the following, we describe how to download it and use it with sherpa-onnx.

Download the model

Please use the following commands to download it.

cd /path/to/sherpa-onnx

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17.tar.bz2

tar xvf icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17.tar.bz2
rm icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17.tar.bz2

Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.

icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17 fangjun$ ls -lh exp/*epoch-60-avg-20*.onnx
-rw-r--r--  1 fangjun  staff   1.2M Jun 27 09:53 exp/decoder-epoch-60-avg-20.int8.onnx
-rw-r--r--  1 fangjun  staff   2.0M Jun 27 09:54 exp/decoder-epoch-60-avg-20.onnx
-rw-r--r--  1 fangjun  staff   121M Jun 27 09:54 exp/encoder-epoch-60-avg-20.int8.onnx
-rw-r--r--  1 fangjun  staff   279M Jun 27 09:55 exp/encoder-epoch-60-avg-20.onnx
-rw-r--r--  1 fangjun  staff   253K Jun 27 09:53 exp/joiner-epoch-60-avg-20.int8.onnx
-rw-r--r--  1 fangjun  staff   1.0M Jun 27 09:53 exp/joiner-epoch-60-avg-20.onnx

Decode wave files

Hint

It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate does not need to be 16 kHz.

fp32

The following code shows how to use fp32 models to decode wave files:

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline \
  --tokens=./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/data/lang_bpe_500/tokens.txt \
  --encoder=./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/exp/encoder-epoch-60-avg-20.onnx \
  --decoder=./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/exp/decoder-epoch-60-avg-20.onnx \
  --joiner=./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/exp/joiner-epoch-60-avg-20.onnx \
  ./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/test_wavs/1089-134686-0001.wav \
  ./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/test_wavs/1221-135766-0001.wav \
  ./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/test_wavs/1221-135766-0002.wav

Note

Please use ./build/bin/Release/sherpa-onnx-offline.exe for Windows.

You should see the following output:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./build/bin/sherpa-onnx-offline --tokens=./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/data/lang_bpe_500/tokens.txt --encoder=./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/exp/encoder-epoch-60-avg-20.onnx --decoder=./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/exp/decoder-epoch-60-avg-20.onnx --joiner=./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/exp/joiner-epoch-60-avg-20.onnx ./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/test_wavs/1089-134686-0001.wav ./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/test_wavs/1221-135766-0001.wav ./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/test_wavs/1221-135766-0002.wav 

OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/exp/encoder-epoch-60-avg-20.onnx", decoder_filename="./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/exp/decoder-epoch-60-avg-20.onnx", joiner_filename="./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/exp/joiner-epoch-60-avg-20.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), tokens="./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/data/lang_bpe_500/tokens.txt", num_threads=2, debug=False, provider="cpu"), lm_config=OfflineLMConfig(model="", scale=0.5), decoding_method="greedy_search", max_active_paths=4, context_score=1.5)
Creating recognizer ...
Started
Done!

./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/test_wavs/1089-134686-0001.wav
{"text":" AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS","timestamps":"[0.00, 0.64, 0.76, 0.84, 1.04, 1.08, 1.16, 1.32, 1.44, 1.56, 1.72, 1.84, 1.88, 1.92, 1.96, 2.04, 2.16, 2.32, 2.48, 2.56, 2.76, 2.80, 2.84, 3.08, 3.28, 3.40, 3.52, 3.68, 4.00, 4.24, 4.28, 4.52, 4.68, 4.84, 4.88, 4.96, 5.04, 5.28, 5.40, 5.52, 5.72, 5.88, 6.08]","tokens":[" AFTER"," E","AR","LY"," ","N","IGHT","F","AL","L"," THE"," ","Y","E","LL","OW"," LA","MP","S"," WOULD"," ","L","IGHT"," UP"," HE","RE"," AND"," THERE"," THE"," S","QUA","LI","D"," ","QUA","R","TER"," OF"," THE"," BRO","TH","EL","S"]}
----
./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/test_wavs/1221-135766-0001.wav
{"text":" GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONORED BOSOM TO CONNECT HER PARENT FOREVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN","timestamps":"[0.04, 0.44, 0.64, 0.84, 0.96, 1.32, 1.52, 1.68, 1.84, 1.88, 2.04, 2.16, 2.32, 2.40, 2.64, 2.88, 3.12, 3.24, 3.44, 3.52, 3.72, 3.88, 4.20, 4.40, 4.48, 4.60, 4.76, 4.96, 5.08, 5.24, 5.36, 5.56, 5.80, 6.20, 6.32, 6.52, 6.92, 7.16, 7.36, 7.60, 7.76, 7.92, 8.16, 8.28, 8.40, 8.48, 8.60, 8.76, 8.84, 9.08, 9.24, 9.44, 9.48, 9.72, 9.88, 10.04, 10.12, 10.52, 10.76, 10.84, 11.08, 11.24, 11.36, 11.60, 11.76, 11.96, 12.08, 12.24, 12.28, 12.48, 12.72, 12.84, 12.92, 13.00, 13.20, 13.52, 13.76, 13.88, 14.08, 14.28, 14.52, 14.64, 14.76, 14.96, 15.04, 15.24, 15.48, 15.68, 15.84, 16.00, 16.04]","tokens":[" GO","D"," AS"," A"," DIRECT"," CON","SE","QUE","N","CE"," OF"," THE"," S","IN"," WHICH"," MAN"," TH","US"," P","UN","ISH","ED"," HAD"," G","IVE","N"," HER"," A"," LO","VE","LY"," CHI","LD"," WHO","SE"," PLACE"," WAS"," ON"," THAT"," SA","ME"," DIS","HO","N","OR","ED"," BO","S","OM"," TO"," CON","N","ECT"," HER"," PA","R","ENT"," FOR","E","VER"," WITH"," THE"," RA","CE"," AND"," DE","S","C","ENT"," OF"," MO","R","T","AL","S"," AND"," TO"," BE"," FIN","ALLY"," A"," B","LES","S","ED"," SO","UL"," IN"," HE","A","VEN"]}
----
./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/test_wavs/1221-135766-0002.wav
{"text":" YET THESE THOUGHTS AFFECTED HESTER PRIN LESS WITH HOPE THAN APPREHENSION","timestamps":"[0.00, 0.04, 0.12, 0.56, 0.80, 0.88, 1.00, 1.04, 1.12, 1.20, 1.28, 1.40, 1.52, 1.64, 1.76, 1.84, 2.04, 2.24, 2.40, 2.64, 2.68, 2.84, 3.04, 3.24, 3.44, 3.52, 3.72, 3.92, 4.00, 4.16, 4.24, 4.36]","tokens":[" ","Y","ET"," THESE"," TH","O","UGH","T","S"," A","FF","ECT","ED"," HE","S","TER"," PRI","N"," LE","S","S"," WITH"," HO","PE"," TH","AN"," APP","RE","HE","N","S","ION"]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 1.611 s
Real time factor (RTF): 1.611 / 28.165 = 0.057

int8

The following code shows how to use int8 models to decode wave files:

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline \
  --tokens=./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/data/lang_bpe_500/tokens.txt \
  --encoder=./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/exp/encoder-epoch-60-avg-20.int8.onnx \
  --decoder=./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/exp/decoder-epoch-60-avg-20.onnx \
  --joiner=./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/exp/joiner-epoch-60-avg-20.int8.onnx \
  ./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/test_wavs/1089-134686-0001.wav \
  ./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/test_wavs/1221-135766-0001.wav \
  ./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/test_wavs/1221-135766-0002.wav

Note

Please use ./build/bin/Release/sherpa-onnx-offline.exe for Windows.

You should see the following output:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./build/bin/sherpa-onnx-offline --tokens=./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/data/lang_bpe_500/tokens.txt --encoder=./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/exp/encoder-epoch-60-avg-20.int8.onnx --decoder=./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/exp/decoder-epoch-60-avg-20.onnx --joiner=./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/exp/joiner-epoch-60-avg-20.int8.onnx ./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/test_wavs/1089-134686-0001.wav ./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/test_wavs/1221-135766-0001.wav ./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/test_wavs/1221-135766-0002.wav 

OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/exp/encoder-epoch-60-avg-20.int8.onnx", decoder_filename="./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/exp/decoder-epoch-60-avg-20.onnx", joiner_filename="./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/exp/joiner-epoch-60-avg-20.int8.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), tokens="./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/data/lang_bpe_500/tokens.txt", num_threads=2, debug=False, provider="cpu"), lm_config=OfflineLMConfig(model="", scale=0.5), decoding_method="greedy_search", max_active_paths=4, context_score=1.5)
Creating recognizer ...
Started
Done!

./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/test_wavs/1089-134686-0001.wav
{"text":" AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS","timestamps":"[0.00, 0.64, 0.76, 0.84, 1.04, 1.08, 1.16, 1.36, 1.44, 1.56, 1.72, 1.84, 1.88, 1.92, 1.96, 2.04, 2.20, 2.32, 2.48, 2.56, 2.76, 2.80, 2.84, 3.08, 3.28, 3.40, 3.52, 3.68, 4.00, 4.24, 4.28, 4.52, 4.68, 4.84, 4.88, 4.96, 5.04, 5.28, 5.36, 5.52, 5.72, 5.88, 6.08]","tokens":[" AFTER"," E","AR","LY"," ","N","IGHT","F","AL","L"," THE"," ","Y","E","LL","OW"," LA","MP","S"," WOULD"," ","L","IGHT"," UP"," HE","RE"," AND"," THERE"," THE"," S","QUA","LI","D"," ","QUA","R","TER"," OF"," THE"," BRO","TH","EL","S"]}
----
./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/test_wavs/1221-135766-0001.wav
{"text":" GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONORED BOSOM TO CONNECT HER PARENT FOREVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN","timestamps":"[0.04, 0.44, 0.64, 0.84, 0.96, 1.32, 1.52, 1.68, 1.84, 1.88, 2.04, 2.16, 2.32, 2.40, 2.64, 2.88, 3.12, 3.24, 3.44, 3.52, 3.72, 3.88, 4.20, 4.40, 4.48, 4.60, 4.76, 4.96, 5.08, 5.24, 5.36, 5.56, 5.80, 6.20, 6.32, 6.52, 6.92, 7.16, 7.32, 7.60, 7.76, 7.92, 8.16, 8.28, 8.40, 8.48, 8.60, 8.76, 8.84, 9.08, 9.24, 9.44, 9.48, 9.72, 9.88, 10.04, 10.12, 10.52, 10.76, 10.84, 11.08, 11.24, 11.36, 11.60, 11.76, 11.96, 12.08, 12.24, 12.28, 12.48, 12.72, 12.84, 12.92, 13.00, 13.20, 13.52, 13.76, 13.88, 14.08, 14.28, 14.52, 14.64, 14.76, 14.96, 15.04, 15.24, 15.48, 15.68, 15.84, 16.00, 16.04]","tokens":[" GO","D"," AS"," A"," DIRECT"," CON","SE","QUE","N","CE"," OF"," THE"," S","IN"," WHICH"," MAN"," TH","US"," P","UN","ISH","ED"," HAD"," G","IVE","N"," HER"," A"," LO","VE","LY"," CHI","LD"," WHO","SE"," PLACE"," WAS"," ON"," THAT"," SA","ME"," DIS","HO","N","OR","ED"," BO","S","OM"," TO"," CON","N","ECT"," HER"," PA","R","ENT"," FOR","E","VER"," WITH"," THE"," RA","CE"," AND"," DE","S","C","ENT"," OF"," MO","R","T","AL","S"," AND"," TO"," BE"," FIN","ALLY"," A"," B","LES","S","ED"," SO","UL"," IN"," HE","A","VEN"]}
----
./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/test_wavs/1221-135766-0002.wav
{"text":" YET THESE THOUGHTS AFFECTED HESTER PRIN LESS WITH HOPE THAN APPREHENSION","timestamps":"[0.00, 0.04, 0.12, 0.56, 0.80, 0.88, 1.00, 1.04, 1.12, 1.20, 1.28, 1.40, 1.52, 1.64, 1.76, 1.84, 2.04, 2.24, 2.40, 2.64, 2.68, 2.84, 3.04, 3.24, 3.44, 3.52, 3.72, 3.92, 4.00, 4.16, 4.24, 4.36]","tokens":[" ","Y","ET"," THESE"," TH","O","UGH","T","S"," A","FF","ECT","ED"," HE","S","TER"," PRI","N"," LE","S","S"," WITH"," HO","PE"," TH","AN"," APP","RE","HE","N","S","ION"]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 1.368 s
Real time factor (RTF): 1.368 / 28.165 = 0.049

Speech recognition from a microphone

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-microphone-offline \
  --tokens=./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/data/lang_bpe_500/tokens.txt \
  --encoder=./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/exp/encoder-epoch-60-avg-20.onnx \
  --decoder=./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/exp/decoder-epoch-60-avg-20.onnx \
  --joiner=./icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17/exp/joiner-epoch-60-avg-20.onnx

k2-fsa/icefall-asr-zipformer-wenetspeech-small (Chinese)

This model is from

https://huggingface.co/k2-fsa/icefall-asr-zipformer-wenetspeech-small

which supports only Chinese as it is trained on the WenetSpeech corpus.

In the following, we describe how to download it.

Download the model

Please use the following commands to download it.

git lfs install
git clone https://huggingface.co/k2-fsa/icefall-asr-zipformer-wenetspeech-small

k2-fsa/icefall-asr-zipformer-wenetspeech-large (Chinese)

This model is from

https://huggingface.co/k2-fsa/icefall-asr-zipformer-wenetspeech-large

which supports only Chinese as it is trained on the WenetSpeech corpus.

In the following, we describe how to download it.

Download the model

Please use the following commands to download it.

git lfs install
git clone https://huggingface.co/k2-fsa/icefall-asr-zipformer-wenetspeech-large

pkufool/icefall-asr-zipformer-wenetspeech-20230615 (Chinese)

This model is from

https://huggingface.co/pkufool/icefall-asr-zipformer-wenetspeech-20230615

which supports only Chinese as it is trained on the WenetSpeech corpus.

If you are interested in how the model is trained, please refer to https://github.com/k2-fsa/icefall/pull/1130.

In the following, we describe how to download it and use it with sherpa-onnx.

Download the model

Please use the following commands to download it.

cd /path/to/sherpa-onnx

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/icefall-asr-zipformer-wenetspeech-20230615.tar.bz2

tar xvf icefall-asr-zipformer-wenetspeech-20230615.tar.bz2
rm icefall-asr-zipformer-wenetspeech-20230615.tar.bz2

Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.

icefall-asr-zipformer-wenetspeech-20230615 fangjun$ ls -lh exp/*.onnx
-rw-r--r--  1 fangjun  staff    11M Jun 26 14:31 exp/decoder-epoch-12-avg-4.int8.onnx
-rw-r--r--  1 fangjun  staff    12M Jun 26 14:31 exp/decoder-epoch-12-avg-4.onnx
-rw-r--r--  1 fangjun  staff    66M Jun 26 14:32 exp/encoder-epoch-12-avg-4.int8.onnx
-rw-r--r--  1 fangjun  staff   248M Jun 26 14:34 exp/encoder-epoch-12-avg-4.onnx
-rw-r--r--  1 fangjun  staff   2.7M Jun 26 14:31 exp/joiner-epoch-12-avg-4.int8.onnx
-rw-r--r--  1 fangjun  staff    11M Jun 26 14:31 exp/joiner-epoch-12-avg-4.onnx

Decode wave files

Hint

It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate does not need to be 16 kHz.

fp32

The following code shows how to use fp32 models to decode wave files:

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline \
  --tokens=./icefall-asr-zipformer-wenetspeech-20230615/data/lang_char/tokens.txt \
  --encoder=./icefall-asr-zipformer-wenetspeech-20230615/exp/encoder-epoch-12-avg-4.onnx \
  --decoder=./icefall-asr-zipformer-wenetspeech-20230615/exp/decoder-epoch-12-avg-4.onnx \
  --joiner=./icefall-asr-zipformer-wenetspeech-20230615/exp/joiner-epoch-12-avg-4.onnx \
  ./icefall-asr-zipformer-wenetspeech-20230615/test_wavs/DEV_T0000000000.wav \
  ./icefall-asr-zipformer-wenetspeech-20230615/test_wavs/DEV_T0000000001.wav \
  ./icefall-asr-zipformer-wenetspeech-20230615/test_wavs/DEV_T0000000002.wav

Note

Please use ./build/bin/Release/sherpa-onnx-offline.exe for Windows.

Caution

If you use Windows and get encoding issues, please run:

CHCP 65001

in your commandline.

You should see the following output:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./build/bin/sherpa-onnx-offline --tokens=./icefall-asr-zipformer-wenetspeech-20230615/data/lang_char/tokens.txt --encoder=./icefall-asr-zipformer-wenetspeech-20230615/exp/encoder-epoch-12-avg-4.onnx --decoder=./icefall-asr-zipformer-wenetspeech-20230615/exp/decoder-epoch-12-avg-4.onnx --joiner=./icefall-asr-zipformer-wenetspeech-20230615/exp/joiner-epoch-12-avg-4.onnx ./icefall-asr-zipformer-wenetspeech-20230615/test_wavs/DEV_T0000000000.wav ./icefall-asr-zipformer-wenetspeech-20230615/test_wavs/DEV_T0000000001.wav ./icefall-asr-zipformer-wenetspeech-20230615/test_wavs/DEV_T0000000002.wav 

OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./icefall-asr-zipformer-wenetspeech-20230615/exp/encoder-epoch-12-avg-4.onnx", decoder_filename="./icefall-asr-zipformer-wenetspeech-20230615/exp/decoder-epoch-12-avg-4.onnx", joiner_filename="./icefall-asr-zipformer-wenetspeech-20230615/exp/joiner-epoch-12-avg-4.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), tokens="./icefall-asr-zipformer-wenetspeech-20230615/data/lang_char/tokens.txt", num_threads=2, debug=False, provider="cpu"), lm_config=OfflineLMConfig(model="", scale=0.5), decoding_method="greedy_search", max_active_paths=4, context_score=1.5)
Creating recognizer ...
Started
Done!

./icefall-asr-zipformer-wenetspeech-20230615/test_wavs/DEV_T0000000000.wav
{"text":"对我做了介绍那么我想说的是大家如果对我的研究感兴趣呢","timestamps":"[0.00, 0.12, 0.48, 0.64, 0.88, 1.16, 1.64, 1.76, 1.92, 2.08, 2.32, 2.48, 2.64, 3.08, 3.20, 3.40, 3.48, 3.64, 3.76, 3.88, 3.96, 4.12, 4.28, 4.52, 4.72, 4.84]","tokens":["对","我","做","了","介","绍","那","么","我","想","说","的","是","大","家","如","果","对","我","的","研","究","感","兴","趣","呢"]}
----
./icefall-asr-zipformer-wenetspeech-20230615/test_wavs/DEV_T0000000001.wav
{"text":"重点想谈三个问题首先就是这一轮全球金融动荡的表现","timestamps":"[0.00, 0.16, 0.48, 0.72, 0.92, 1.08, 1.28, 1.52, 1.92, 2.08, 2.52, 2.64, 2.88, 3.04, 3.20, 3.40, 3.56, 3.76, 3.84, 4.00, 4.16, 4.32, 4.56, 4.84]","tokens":["重","点","想","谈","三","个","问","题","首","先","就","是","这","一","轮","全","球","金","融","动","荡","的","表","现"]}
----
./icefall-asr-zipformer-wenetspeech-20230615/test_wavs/DEV_T0000000002.wav
{"text":"深入地分析这一次全球金融动荡背后的根源","timestamps":"[0.00, 0.32, 0.56, 0.84, 1.12, 1.44, 1.68, 1.84, 2.28, 2.48, 2.76, 2.92, 3.12, 3.28, 3.44, 3.60, 3.72, 3.92, 4.20]","tokens":["深","入","地","分","析","这","一","次","全","球","金","融","动","荡","背","后","的","根","源"]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 0.458 s
Real time factor (RTF): 0.458 / 15.289 = 0.030

int8

The following code shows how to use int8 models to decode wave files:

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline \
  --tokens=./icefall-asr-zipformer-wenetspeech-20230615/data/lang_char/tokens.txt \
  --encoder=./icefall-asr-zipformer-wenetspeech-20230615/exp/encoder-epoch-12-avg-4.int8.onnx \
  --decoder=./icefall-asr-zipformer-wenetspeech-20230615/exp/decoder-epoch-12-avg-4.onnx \
  --joiner=./icefall-asr-zipformer-wenetspeech-20230615/exp/joiner-epoch-12-avg-4.int8.onnx \
  ./icefall-asr-zipformer-wenetspeech-20230615/test_wavs/DEV_T0000000000.wav \
  ./icefall-asr-zipformer-wenetspeech-20230615/test_wavs/DEV_T0000000001.wav \
  ./icefall-asr-zipformer-wenetspeech-20230615/test_wavs/DEV_T0000000002.wav

Note

Please use ./build/bin/Release/sherpa-onnx-offline.exe for Windows.

Caution

If you use Windows and get encoding issues, please run:

CHCP 65001

in your commandline.

You should see the following output:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./build/bin/sherpa-onnx-offline --tokens=./icefall-asr-zipformer-wenetspeech-20230615/data/lang_char/tokens.txt --encoder=./icefall-asr-zipformer-wenetspeech-20230615/exp/encoder-epoch-12-avg-4.int8.onnx --decoder=./icefall-asr-zipformer-wenetspeech-20230615/exp/decoder-epoch-12-avg-4.onnx --joiner=./icefall-asr-zipformer-wenetspeech-20230615/exp/joiner-epoch-12-avg-4.int8.onnx ./icefall-asr-zipformer-wenetspeech-20230615/test_wavs/DEV_T0000000000.wav ./icefall-asr-zipformer-wenetspeech-20230615/test_wavs/DEV_T0000000001.wav ./icefall-asr-zipformer-wenetspeech-20230615/test_wavs/DEV_T0000000002.wav 

OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./icefall-asr-zipformer-wenetspeech-20230615/exp/encoder-epoch-12-avg-4.int8.onnx", decoder_filename="./icefall-asr-zipformer-wenetspeech-20230615/exp/decoder-epoch-12-avg-4.onnx", joiner_filename="./icefall-asr-zipformer-wenetspeech-20230615/exp/joiner-epoch-12-avg-4.int8.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), tokens="./icefall-asr-zipformer-wenetspeech-20230615/data/lang_char/tokens.txt", num_threads=2, debug=False, provider="cpu"), lm_config=OfflineLMConfig(model="", scale=0.5), decoding_method="greedy_search", max_active_paths=4, context_score=1.5)
Creating recognizer ...
Started
Done!

./icefall-asr-zipformer-wenetspeech-20230615/test_wavs/DEV_T0000000000.wav
{"text":"对我做了介绍那么我想说的是大家如果对我的研究感兴趣呢","timestamps":"[0.00, 0.12, 0.48, 0.60, 0.80, 1.08, 1.64, 1.76, 1.92, 2.08, 2.32, 2.48, 2.64, 3.08, 3.20, 3.28, 3.44, 3.60, 3.72, 3.84, 3.92, 4.12, 4.28, 4.48, 4.72, 4.84]","tokens":["对","我","做","了","介","绍","那","么","我","想","说","的","是","大","家","如","果","对","我","的","研","究","感","兴","趣","呢"]}
----
./icefall-asr-zipformer-wenetspeech-20230615/test_wavs/DEV_T0000000001.wav
{"text":"重点想谈三个问题首先呢就是这一轮全球金融动荡的表现","timestamps":"[0.00, 0.16, 0.48, 0.68, 0.84, 1.08, 1.20, 1.48, 1.64, 2.08, 2.36, 2.52, 2.64, 2.84, 3.00, 3.16, 3.40, 3.52, 3.72, 3.84, 4.00, 4.16, 4.32, 4.56, 4.84]","tokens":["重","点","想","谈","三","个","问","题","首","先","呢","就","是","这","一","轮","全","球","金","融","动","荡","的","表","现"]}
----
./icefall-asr-zipformer-wenetspeech-20230615/test_wavs/DEV_T0000000002.wav
{"text":"深入地分析这一次全球金融动荡荡背后的根源","timestamps":"[0.00, 0.12, 0.48, 0.84, 1.08, 1.44, 1.60, 1.84, 2.24, 2.48, 2.76, 2.88, 3.12, 3.24, 3.28, 3.36, 3.60, 3.72, 3.84, 4.16]","tokens":["深","入","地","分","析","这","一","次","全","球","金","融","动","荡","荡","背","后","的","根","源"]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 0.338 s
Real time factor (RTF): 0.338 / 15.289 = 0.022

Speech recognition from a microphone

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-microphone-offline \
  --tokens=./icefall-asr-zipformer-wenetspeech-20230615/data/lang_char/tokens.txt \
  --encoder=./icefall-asr-zipformer-wenetspeech-20230615/exp/encoder-epoch-12-avg-4.onnx \
  --decoder=./icefall-asr-zipformer-wenetspeech-20230615/exp/decoder-epoch-12-avg-4.onnx \
  --joiner=./icefall-asr-zipformer-wenetspeech-20230615/exp/joiner-epoch-12-avg-4.onnx

csukuangfj/sherpa-onnx-zipformer-large-en-2023-06-26 (English)

This model is converted from

https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-large-2023-05-16

which supports only English as it is trained on the LibriSpeech corpus.

You can find the training code at

https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/zipformer

In the following, we describe how to download it and use it with sherpa-onnx.

Download the model

Please use the following commands to download it.

cd /path/to/sherpa-onnx

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-zipformer-large-en-2023-06-26.tar.bz2

tar xvf sherpa-onnx-zipformer-large-en-2023-06-26.tar.bz2
rm sherpa-onnx-zipformer-large-en-2023-06-26.tar.bz2

Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.

sherpa-onnx-zipformer-large-en-2023-06-26 fangjun$ ls -lh *.onnx
-rw-r--r--  1 fangjun  staff   1.2M Jun 26 13:19 decoder-epoch-99-avg-1.int8.onnx
-rw-r--r--  1 fangjun  staff   2.0M Jun 26 13:19 decoder-epoch-99-avg-1.onnx
-rw-r--r--  1 fangjun  staff   145M Jun 26 13:20 encoder-epoch-99-avg-1.int8.onnx
-rw-r--r--  1 fangjun  staff   564M Jun 26 13:22 encoder-epoch-99-avg-1.onnx
-rw-r--r--  1 fangjun  staff   253K Jun 26 13:19 joiner-epoch-99-avg-1.int8.onnx
-rw-r--r--  1 fangjun  staff   1.0M Jun 26 13:19 joiner-epoch-99-avg-1.onnx

Decode wave files

Hint

It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate does not need to be 16 kHz.

fp32

The following code shows how to use fp32 models to decode wave files:

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline \
  --tokens=./sherpa-onnx-zipformer-large-en-2023-06-26/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-large-en-2023-06-26/encoder-epoch-99-avg-1.onnx \
  --decoder=./sherpa-onnx-zipformer-large-en-2023-06-26/decoder-epoch-99-avg-1.onnx \
  --joiner=./sherpa-onnx-zipformer-large-en-2023-06-26/joiner-epoch-99-avg-1.onnx \
  ./sherpa-onnx-zipformer-large-en-2023-06-26/test_wavs/0.wav \
  ./sherpa-onnx-zipformer-large-en-2023-06-26/test_wavs/1.wav \
  ./sherpa-onnx-zipformer-large-en-2023-06-26/test_wavs/8k.wav

Note

Please use ./build/bin/Release/sherpa-onnx-offline.exe for Windows.

You should see the following output:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./build/bin/sherpa-onnx-offline --tokens=./sherpa-onnx-zipformer-large-en-2023-06-26/tokens.txt --encoder=./sherpa-onnx-zipformer-large-en-2023-06-26/encoder-epoch-99-avg-1.onnx --decoder=./sherpa-onnx-zipformer-large-en-2023-06-26/decoder-epoch-99-avg-1.onnx --joiner=./sherpa-onnx-zipformer-large-en-2023-06-26/joiner-epoch-99-avg-1.onnx ./sherpa-onnx-zipformer-large-en-2023-06-26/test_wavs/0.wav ./sherpa-onnx-zipformer-large-en-2023-06-26/test_wavs/1.wav ./sherpa-onnx-zipformer-large-en-2023-06-26/test_wavs/8k.wav 

OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-zipformer-large-en-2023-06-26/encoder-epoch-99-avg-1.onnx", decoder_filename="./sherpa-onnx-zipformer-large-en-2023-06-26/decoder-epoch-99-avg-1.onnx", joiner_filename="./sherpa-onnx-zipformer-large-en-2023-06-26/joiner-epoch-99-avg-1.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), tokens="./sherpa-onnx-zipformer-large-en-2023-06-26/tokens.txt", num_threads=2, debug=False, provider="cpu"), lm_config=OfflineLMConfig(model="", scale=0.5), decoding_method="greedy_search", max_active_paths=4, context_score=1.5)
Creating recognizer ...
Started
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/offline-stream.cc:AcceptWaveformImpl:108 Creating a resampler:
   in_sample_rate: 8000
   output_sample_rate: 16000

Done!

./sherpa-onnx-zipformer-large-en-2023-06-26/test_wavs/0.wav
{"text":" AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS","timestamps":"[0.00, 0.48, 0.60, 0.72, 1.04, 1.28, 1.36, 1.48, 1.60, 1.84, 1.96, 2.00, 2.16, 2.32, 2.40, 2.48, 2.60, 2.80, 3.04, 3.28, 3.40, 3.56, 3.76, 4.04, 4.24, 4.28, 4.48, 4.64, 4.80, 4.84, 5.00, 5.04, 5.28, 5.40, 5.56, 5.60, 5.76, 5.96, 6.16]","tokens":[" AFTER"," E","AR","LY"," NIGHT","F","A","LL"," THE"," YE","LL","OW"," LA","M","P","S"," WOULD"," LIGHT"," UP"," HE","RE"," AND"," THERE"," THE"," S","QUA","LI","D"," ","QUA","R","TER"," OF"," THE"," B","RO","TH","EL","S"]}
----
./sherpa-onnx-zipformer-large-en-2023-06-26/test_wavs/1.wav
{"text":" GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONORED BOSOM TO CONNECT HER PARENT FOR EVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN","timestamps":"[0.00, 0.20, 0.48, 0.72, 0.88, 1.04, 1.12, 1.20, 1.36, 1.52, 1.68, 1.84, 1.88, 2.00, 2.12, 2.32, 2.36, 2.60, 2.84, 3.12, 3.24, 3.48, 3.56, 3.76, 3.92, 4.12, 4.36, 4.56, 4.72, 4.96, 5.16, 5.44, 5.68, 6.12, 6.28, 6.48, 6.88, 7.12, 7.36, 7.56, 7.92, 8.16, 8.28, 8.40, 8.48, 8.60, 8.76, 8.88, 9.08, 9.28, 9.44, 9.52, 9.60, 9.72, 9.92, 10.00, 10.12, 10.48, 10.68, 10.76, 11.00, 11.20, 11.36, 11.56, 11.76, 12.00, 12.12, 12.28, 12.32, 12.52, 12.72, 12.84, 12.92, 13.04, 13.20, 13.44, 13.64, 13.76, 14.00, 14.12, 14.24, 14.36, 14.52, 14.72, 14.80, 15.04, 15.28, 15.52, 15.76, 16.00, 16.20, 16.24, 16.32]","tokens":[" GO","D"," AS"," A"," DI","RE","C","T"," CON","SE","QUE","N","CE"," OF"," THE"," S","IN"," WHICH"," MAN"," TH","US"," P","UN","ISH","ED"," HAD"," GIVE","N"," HER"," A"," LOVE","LY"," CHILD"," WHO","SE"," PLACE"," WAS"," ON"," THAT"," SAME"," DIS","HO","N","OR","ED"," BO","S","OM"," TO"," CON","NE","C","T"," HER"," P","AR","ENT"," FOR"," E","VER"," WITH"," THE"," RA","CE"," AND"," DE","S","C","ENT"," OF"," MO","R","T","AL","S"," AND"," TO"," BE"," FI","N","AL","LY"," A"," B","LESS","ED"," SO","UL"," IN"," HE","A","VE","N"]}
----
./sherpa-onnx-zipformer-large-en-2023-06-26/test_wavs/8k.wav
{"text":" YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION","timestamps":"[0.00, 0.12, 0.36, 0.48, 0.76, 0.96, 1.12, 1.24, 1.32, 1.44, 1.48, 1.68, 1.76, 1.88, 2.04, 2.12, 2.24, 2.28, 2.48, 2.56, 2.80, 3.08, 3.28, 3.52, 3.80, 3.92, 4.00, 4.16, 4.24, 4.36, 4.44]","tokens":[" YE","T"," THE","SE"," THOUGHT","S"," A","FF","E","C","TED"," HE","S","TER"," P","RY","N","NE"," ","LESS"," WITH"," HO","PE"," THAN"," A","PP","RE","HE","N","S","ION"]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 1.843 s
Real time factor (RTF): 1.843 / 28.165 = 0.065

int8

The following code shows how to use int8 models to decode wave files:

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline \
  --tokens=./sherpa-onnx-zipformer-large-en-2023-06-26/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-large-en-2023-06-26/encoder-epoch-99-avg-1.int8.onnx \
  --decoder=./sherpa-onnx-zipformer-large-en-2023-06-26/decoder-epoch-99-avg-1.onnx \
  --joiner=./sherpa-onnx-zipformer-large-en-2023-06-26/joiner-epoch-99-avg-1.int8.onnx \
  ./sherpa-onnx-zipformer-large-en-2023-06-26/test_wavs/0.wav \
  ./sherpa-onnx-zipformer-large-en-2023-06-26/test_wavs/1.wav \
  ./sherpa-onnx-zipformer-large-en-2023-06-26/test_wavs/8k.wav

Note

Please use ./build/bin/Release/sherpa-onnx-offline.exe for Windows.

You should see the following output:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./build/bin/sherpa-onnx-offline --tokens=./sherpa-onnx-zipformer-large-en-2023-06-26/tokens.txt --encoder=./sherpa-onnx-zipformer-large-en-2023-06-26/encoder-epoch-99-avg-1.int8.onnx --decoder=./sherpa-onnx-zipformer-large-en-2023-06-26/decoder-epoch-99-avg-1.onnx --joiner=./sherpa-onnx-zipformer-large-en-2023-06-26/joiner-epoch-99-avg-1.int8.onnx ./sherpa-onnx-zipformer-large-en-2023-06-26/test_wavs/0.wav ./sherpa-onnx-zipformer-large-en-2023-06-26/test_wavs/1.wav ./sherpa-onnx-zipformer-large-en-2023-06-26/test_wavs/8k.wav 

OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-zipformer-large-en-2023-06-26/encoder-epoch-99-avg-1.int8.onnx", decoder_filename="./sherpa-onnx-zipformer-large-en-2023-06-26/decoder-epoch-99-avg-1.onnx", joiner_filename="./sherpa-onnx-zipformer-large-en-2023-06-26/joiner-epoch-99-avg-1.int8.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), tokens="./sherpa-onnx-zipformer-large-en-2023-06-26/tokens.txt", num_threads=2, debug=False, provider="cpu"), lm_config=OfflineLMConfig(model="", scale=0.5), decoding_method="greedy_search", max_active_paths=4, context_score=1.5)
Creating recognizer ...
Started
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/offline-stream.cc:AcceptWaveformImpl:108 Creating a resampler:
   in_sample_rate: 8000
   output_sample_rate: 16000

Done!

./sherpa-onnx-zipformer-large-en-2023-06-26/test_wavs/0.wav
{"text":" AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS","timestamps":"[0.00, 0.48, 0.60, 0.72, 1.04, 1.28, 1.36, 1.48, 1.60, 1.84, 1.96, 2.00, 2.16, 2.32, 2.40, 2.48, 2.60, 2.80, 3.04, 3.28, 3.40, 3.56, 3.76, 4.04, 4.24, 4.28, 4.48, 4.64, 4.80, 4.84, 5.00, 5.04, 5.28, 5.40, 5.56, 5.60, 5.76, 5.96, 6.16]","tokens":[" AFTER"," E","AR","LY"," NIGHT","F","A","LL"," THE"," YE","LL","OW"," LA","M","P","S"," WOULD"," LIGHT"," UP"," HE","RE"," AND"," THERE"," THE"," S","QUA","LI","D"," ","QUA","R","TER"," OF"," THE"," B","RO","TH","EL","S"]}
----
./sherpa-onnx-zipformer-large-en-2023-06-26/test_wavs/1.wav
{"text":" GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONORED BOSOM TO CONNECT HER PARENT FOR EVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN","timestamps":"[0.00, 0.20, 0.48, 0.72, 0.88, 1.04, 1.12, 1.20, 1.36, 1.52, 1.64, 1.84, 1.88, 2.00, 2.12, 2.32, 2.36, 2.60, 2.84, 3.12, 3.24, 3.48, 3.56, 3.76, 3.92, 4.12, 4.36, 4.52, 4.72, 4.96, 5.16, 5.44, 5.68, 6.12, 6.28, 6.48, 6.88, 7.12, 7.36, 7.56, 7.92, 8.16, 8.28, 8.40, 8.48, 8.60, 8.76, 8.88, 9.08, 9.28, 9.44, 9.52, 9.60, 9.72, 9.92, 10.00, 10.12, 10.48, 10.68, 10.76, 11.00, 11.20, 11.36, 11.56, 11.76, 12.00, 12.12, 12.28, 12.32, 12.52, 12.72, 12.84, 12.92, 13.04, 13.20, 13.44, 13.64, 13.76, 14.00, 14.08, 14.24, 14.36, 14.52, 14.72, 14.76, 15.04, 15.28, 15.52, 15.76, 16.00, 16.20, 16.24, 16.32]","tokens":[" GO","D"," AS"," A"," DI","RE","C","T"," CON","SE","QUE","N","CE"," OF"," THE"," S","IN"," WHICH"," MAN"," TH","US"," P","UN","ISH","ED"," HAD"," GIVE","N"," HER"," A"," LOVE","LY"," CHILD"," WHO","SE"," PLACE"," WAS"," ON"," THAT"," SAME"," DIS","HO","N","OR","ED"," BO","S","OM"," TO"," CON","NE","C","T"," HER"," P","AR","ENT"," FOR"," E","VER"," WITH"," THE"," RA","CE"," AND"," DE","S","C","ENT"," OF"," MO","R","T","AL","S"," AND"," TO"," BE"," FI","N","AL","LY"," A"," B","LESS","ED"," SO","UL"," IN"," HE","A","VE","N"]}
----
./sherpa-onnx-zipformer-large-en-2023-06-26/test_wavs/8k.wav
{"text":" YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION","timestamps":"[0.00, 0.12, 0.36, 0.48, 0.76, 0.96, 1.12, 1.24, 1.32, 1.44, 1.48, 1.68, 1.76, 1.88, 2.04, 2.12, 2.28, 2.32, 2.48, 2.52, 2.80, 3.08, 3.28, 3.52, 3.76, 3.92, 4.00, 4.16, 4.24, 4.36, 4.44]","tokens":[" YE","T"," THE","SE"," THOUGHT","S"," A","FF","E","C","TED"," HE","S","TER"," P","RY","N","NE"," ","LESS"," WITH"," HO","PE"," THAN"," A","PP","RE","HE","N","S","ION"]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 1.490 s
Real time factor (RTF): 1.490 / 28.165 = 0.053

Speech recognition from a microphone

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-microphone-offline \
  --tokens=./sherpa-onnx-zipformer-large-en-2023-06-26/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-large-en-2023-06-26/encoder-epoch-99-avg-1.onnx \
  --decoder=./sherpa-onnx-zipformer-large-en-2023-06-26/decoder-epoch-99-avg-1.onnx \
  --joiner=./sherpa-onnx-zipformer-large-en-2023-06-26/joiner-epoch-99-avg-1.onnx

csukuangfj/sherpa-onnx-zipformer-small-en-2023-06-26 (English)

This model is converted from

https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-small-2023-05-16

which supports only English as it is trained on the LibriSpeech corpus.

You can find the training code at

https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/zipformer

In the following, we describe how to download it and use it with sherpa-onnx.

Download the model

Please use the following commands to download it.

cd /path/to/sherpa-onnx

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-zipformer-small-en-2023-06-26.tar.bz2

tar xvf sherpa-onnx-zipformer-small-en-2023-06-26.tar.bz2
rm sherpa-onnx-zipformer-small-en-2023-06-26.tar.bz2

Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.

sherpa-onnx-zipformer-small-en-2023-06-26 fangjun$ ls -lh *.onnx
-rw-r--r--  1 fangjun  staff   1.2M Jun 26 13:04 decoder-epoch-99-avg-1.int8.onnx
-rw-r--r--  1 fangjun  staff   2.0M Jun 26 13:04 decoder-epoch-99-avg-1.onnx
-rw-r--r--  1 fangjun  staff    25M Jun 26 13:04 encoder-epoch-99-avg-1.int8.onnx
-rw-r--r--  1 fangjun  staff    87M Jun 26 13:04 encoder-epoch-99-avg-1.onnx
-rw-r--r--  1 fangjun  staff   253K Jun 26 13:04 joiner-epoch-99-avg-1.int8.onnx
-rw-r--r--  1 fangjun  staff   1.0M Jun 26 13:04 joiner-epoch-99-avg-1.onnx

Decode wave files

Hint

It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate does not need to be 16 kHz.

fp32

The following code shows how to use fp32 models to decode wave files:

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline \
  --tokens=./sherpa-onnx-zipformer-small-en-2023-06-26/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-small-en-2023-06-26/encoder-epoch-99-avg-1.onnx \
  --decoder=./sherpa-onnx-zipformer-small-en-2023-06-26/decoder-epoch-99-avg-1.onnx \
  --joiner=./sherpa-onnx-zipformer-small-en-2023-06-26/joiner-epoch-99-avg-1.onnx \
  ./sherpa-onnx-zipformer-small-en-2023-06-26/test_wavs/0.wav \
  ./sherpa-onnx-zipformer-small-en-2023-06-26/test_wavs/1.wav \
  ./sherpa-onnx-zipformer-small-en-2023-06-26/test_wavs/8k.wav

Note

Please use ./build/bin/Release/sherpa-onnx-offline.exe for Windows.

You should see the following output:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./build/bin/sherpa-onnx-offline --tokens=./sherpa-onnx-zipformer-small-en-2023-06-26/tokens.txt --encoder=./sherpa-onnx-zipformer-small-en-2023-06-26/encoder-epoch-99-avg-1.onnx --decoder=./sherpa-onnx-zipformer-small-en-2023-06-26/decoder-epoch-99-avg-1.onnx --joiner=./sherpa-onnx-zipformer-small-en-2023-06-26/joiner-epoch-99-avg-1.onnx ./sherpa-onnx-zipformer-small-en-2023-06-26/test_wavs/0.wav ./sherpa-onnx-zipformer-small-en-2023-06-26/test_wavs/1.wav ./sherpa-onnx-zipformer-small-en-2023-06-26/test_wavs/8k.wav 

OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-zipformer-small-en-2023-06-26/encoder-epoch-99-avg-1.onnx", decoder_filename="./sherpa-onnx-zipformer-small-en-2023-06-26/decoder-epoch-99-avg-1.onnx", joiner_filename="./sherpa-onnx-zipformer-small-en-2023-06-26/joiner-epoch-99-avg-1.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), tokens="./sherpa-onnx-zipformer-small-en-2023-06-26/tokens.txt", num_threads=2, debug=False, provider="cpu"), lm_config=OfflineLMConfig(model="", scale=0.5), decoding_method="greedy_search", max_active_paths=4, context_score=1.5)
Creating recognizer ...
Started
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/offline-stream.cc:AcceptWaveformImpl:108 Creating a resampler:
   in_sample_rate: 8000
   output_sample_rate: 16000

Done!

./sherpa-onnx-zipformer-small-en-2023-06-26/test_wavs/0.wav
{"text":" AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS","timestamps":"[0.00, 0.64, 0.76, 0.84, 1.12, 1.36, 1.44, 1.56, 1.72, 1.84, 1.96, 2.04, 2.20, 2.32, 2.36, 2.44, 2.60, 2.76, 3.04, 3.24, 3.40, 3.52, 3.72, 4.04, 4.20, 4.28, 4.48, 4.64, 4.80, 4.84, 4.96, 5.00, 5.28, 5.40, 5.52, 5.60, 5.76, 5.92, 6.08]","tokens":[" AFTER"," E","AR","LY"," NIGHT","F","A","LL"," THE"," YE","LL","OW"," LA","M","P","S"," WOULD"," LIGHT"," UP"," HE","RE"," AND"," THERE"," THE"," S","QUA","LI","D"," ","QUA","R","TER"," OF"," THE"," B","RO","TH","EL","S"]}
----
./sherpa-onnx-zipformer-small-en-2023-06-26/test_wavs/1.wav
{"text":" GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONOURED BOSOM TO CONNECT HER PARENT FOREVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN","timestamps":"[0.00, 0.32, 0.64, 0.80, 0.96, 1.08, 1.16, 1.20, 1.32, 1.52, 1.68, 1.80, 1.88, 2.04, 2.16, 2.32, 2.40, 2.64, 2.88, 3.16, 3.20, 3.44, 3.52, 3.72, 3.88, 4.16, 4.44, 4.60, 4.76, 4.96, 5.16, 5.36, 5.60, 6.16, 6.32, 6.52, 6.88, 7.16, 7.32, 7.60, 7.96, 8.16, 8.28, 8.36, 8.48, 8.64, 8.76, 8.84, 9.04, 9.28, 9.44, 9.52, 9.60, 9.68, 9.88, 9.92, 10.12, 10.52, 10.76, 10.80, 11.08, 11.20, 11.36, 11.56, 11.76, 11.96, 12.08, 12.24, 12.28, 12.48, 12.68, 12.80, 12.92, 13.00, 13.20, 13.48, 13.72, 13.84, 14.04, 14.20, 14.28, 14.40, 14.56, 14.68, 14.76, 15.00, 15.24, 15.48, 15.68, 15.92, 16.08, 16.12, 16.20]","tokens":[" GO","D"," AS"," A"," DI","RE","C","T"," CON","SE","QUE","N","CE"," OF"," THE"," S","IN"," WHICH"," MAN"," TH","US"," P","UN","ISH","ED"," HAD"," GIVE","N"," HER"," A"," LOVE","LY"," CHILD"," WHO","SE"," PLACE"," WAS"," ON"," THAT"," SAME"," DIS","HO","N","OUR","ED"," BO","S","OM"," TO"," CON","NE","C","T"," HER"," P","AR","ENT"," FOR","E","VER"," WITH"," THE"," RA","CE"," AND"," DE","S","C","ENT"," OF"," MO","R","T","AL","S"," AND"," TO"," BE"," FI","N","AL","LY"," A"," B","LESS","ED"," SO","UL"," IN"," HE","A","VE","N"]}
----
./sherpa-onnx-zipformer-small-en-2023-06-26/test_wavs/8k.wav
{"text":" YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION","timestamps":"[0.00, 0.32, 0.48, 0.64, 0.84, 1.08, 1.20, 1.32, 1.36, 1.44, 1.48, 1.64, 1.76, 1.88, 2.08, 2.12, 2.24, 2.28, 2.44, 2.48, 2.80, 3.04, 3.24, 3.48, 3.72, 3.88, 3.92, 4.08, 4.16, 4.24, 4.36]","tokens":[" YE","T"," THE","SE"," THOUGHT","S"," A","FF","E","C","TED"," HE","S","TER"," P","RY","N","NE"," ","LESS"," WITH"," HO","PE"," THAN"," A","PP","RE","HE","N","S","ION"]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 0.953 s
Real time factor (RTF): 0.953 / 28.165 = 0.034

int8

The following code shows how to use int8 models to decode wave files:

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline \
  --tokens=./sherpa-onnx-zipformer-small-en-2023-06-26/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-small-en-2023-06-26/encoder-epoch-99-avg-1.int8.onnx \
  --decoder=./sherpa-onnx-zipformer-small-en-2023-06-26/decoder-epoch-99-avg-1.onnx \
  --joiner=./sherpa-onnx-zipformer-small-en-2023-06-26/joiner-epoch-99-avg-1.int8.onnx \
  ./sherpa-onnx-zipformer-small-en-2023-06-26/test_wavs/0.wav \
  ./sherpa-onnx-zipformer-small-en-2023-06-26/test_wavs/1.wav \
  ./sherpa-onnx-zipformer-small-en-2023-06-26/test_wavs/8k.wav

Note

Please use ./build/bin/Release/sherpa-onnx-offline.exe for Windows.

You should see the following output:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./build/bin/sherpa-onnx-offline --tokens=./sherpa-onnx-zipformer-small-en-2023-06-26/tokens.txt --encoder=./sherpa-onnx-zipformer-small-en-2023-06-26/encoder-epoch-99-avg-1.int8.onnx --decoder=./sherpa-onnx-zipformer-small-en-2023-06-26/decoder-epoch-99-avg-1.onnx --joiner=./sherpa-onnx-zipformer-small-en-2023-06-26/joiner-epoch-99-avg-1.int8.onnx ./sherpa-onnx-zipformer-small-en-2023-06-26/test_wavs/0.wav ./sherpa-onnx-zipformer-small-en-2023-06-26/test_wavs/1.wav ./sherpa-onnx-zipformer-small-en-2023-06-26/test_wavs/8k.wav 

OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-zipformer-small-en-2023-06-26/encoder-epoch-99-avg-1.int8.onnx", decoder_filename="./sherpa-onnx-zipformer-small-en-2023-06-26/decoder-epoch-99-avg-1.onnx", joiner_filename="./sherpa-onnx-zipformer-small-en-2023-06-26/joiner-epoch-99-avg-1.int8.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), tokens="./sherpa-onnx-zipformer-small-en-2023-06-26/tokens.txt", num_threads=2, debug=False, provider="cpu"), lm_config=OfflineLMConfig(model="", scale=0.5), decoding_method="greedy_search", max_active_paths=4, context_score=1.5)
Creating recognizer ...
Started
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/offline-stream.cc:AcceptWaveformImpl:108 Creating a resampler:
   in_sample_rate: 8000
   output_sample_rate: 16000

Done!

./sherpa-onnx-zipformer-small-en-2023-06-26/test_wavs/0.wav
{"text":" AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS","timestamps":"[0.00, 0.64, 0.76, 0.84, 1.08, 1.36, 1.44, 1.56, 1.72, 1.84, 1.96, 2.04, 2.20, 2.32, 2.36, 2.44, 2.60, 2.76, 3.04, 3.24, 3.40, 3.52, 3.72, 4.00, 4.20, 4.28, 4.48, 4.64, 4.80, 4.84, 4.96, 5.00, 5.28, 5.40, 5.52, 5.60, 5.76, 5.92, 6.08]","tokens":[" AFTER"," E","AR","LY"," NIGHT","F","A","LL"," THE"," YE","LL","OW"," LA","M","P","S"," WOULD"," LIGHT"," UP"," HE","RE"," AND"," THERE"," THE"," S","QUA","LI","D"," ","QUA","R","TER"," OF"," THE"," B","RO","TH","EL","S"]}
----
./sherpa-onnx-zipformer-small-en-2023-06-26/test_wavs/1.wav
{"text":" GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONOURED BOSOM TO CONNECT HER PARENT FOREVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN","timestamps":"[0.00, 0.32, 0.64, 0.80, 0.96, 1.08, 1.16, 1.20, 1.32, 1.52, 1.68, 1.80, 1.88, 2.04, 2.16, 2.32, 2.40, 2.64, 2.88, 3.16, 3.20, 3.44, 3.52, 3.72, 3.88, 4.16, 4.44, 4.60, 4.76, 4.96, 5.16, 5.36, 5.60, 6.16, 6.32, 6.52, 6.88, 7.16, 7.32, 7.60, 7.96, 8.16, 8.28, 8.36, 8.48, 8.64, 8.76, 8.84, 9.04, 9.28, 9.44, 9.52, 9.60, 9.68, 9.88, 9.92, 10.12, 10.52, 10.76, 10.80, 11.08, 11.20, 11.36, 11.56, 11.76, 11.96, 12.08, 12.24, 12.28, 12.48, 12.68, 12.80, 12.92, 13.04, 13.16, 13.48, 13.72, 13.84, 14.04, 14.20, 14.28, 14.40, 14.56, 14.68, 14.76, 15.00, 15.28, 15.48, 15.68, 15.92, 16.08, 16.12, 16.20]","tokens":[" GO","D"," AS"," A"," DI","RE","C","T"," CON","SE","QUE","N","CE"," OF"," THE"," S","IN"," WHICH"," MAN"," TH","US"," P","UN","ISH","ED"," HAD"," GIVE","N"," HER"," A"," LOVE","LY"," CHILD"," WHO","SE"," PLACE"," WAS"," ON"," THAT"," SAME"," DIS","HO","N","OUR","ED"," BO","S","OM"," TO"," CON","NE","C","T"," HER"," P","AR","ENT"," FOR","E","VER"," WITH"," THE"," RA","CE"," AND"," DE","S","C","ENT"," OF"," MO","R","T","AL","S"," AND"," TO"," BE"," FI","N","AL","LY"," A"," B","LESS","ED"," SO","UL"," IN"," HE","A","VE","N"]}
----
./sherpa-onnx-zipformer-small-en-2023-06-26/test_wavs/8k.wav
{"text":" YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION","timestamps":"[0.00, 0.32, 0.48, 0.64, 0.84, 1.08, 1.20, 1.32, 1.36, 1.44, 1.48, 1.64, 1.76, 1.88, 2.08, 2.12, 2.24, 2.28, 2.44, 2.48, 2.80, 3.04, 3.24, 3.48, 3.72, 3.88, 3.92, 4.08, 4.16, 4.24, 4.36]","tokens":[" YE","T"," THE","SE"," THOUGHT","S"," A","FF","E","C","TED"," HE","S","TER"," P","RY","N","NE"," ","LESS"," WITH"," HO","PE"," THAN"," A","PP","RE","HE","N","S","ION"]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 0.891 s
Real time factor (RTF): 0.891 / 28.165 = 0.032

Speech recognition from a microphone

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-microphone-offline \
  --tokens=./sherpa-onnx-zipformer-small-en-2023-06-26/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-small-en-2023-06-26/encoder-epoch-99-avg-1.onnx \
  --decoder=./sherpa-onnx-zipformer-small-en-2023-06-26/decoder-epoch-99-avg-1.onnx \
  --joiner=./sherpa-onnx-zipformer-small-en-2023-06-26/joiner-epoch-99-avg-1.onnx

csukuangfj/sherpa-onnx-zipformer-en-2023-06-26 (English)

This model is converted from

https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-2023-05-15

which supports only English as it is trained on the LibriSpeech corpus.

You can find the training code at

https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/zipformer

In the following, we describe how to download it and use it with sherpa-onnx.

Download the model

Please use the following commands to download it.

cd /path/to/sherpa-onnx

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-zipformer-en-2023-06-26.tar.bz2

tar xvf sherpa-onnx-zipformer-en-2023-06-26.tar.bz2
rm sherpa-onnx-zipformer-en-2023-06-26.tar.bz2

Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.

sherpa-onnx-zipformer-en-2023-06-26 fangjun$ ls -lh *.onnx
-rw-r--r--  1 fangjun  staff   1.2M Jun 26 12:45 decoder-epoch-99-avg-1.int8.onnx
-rw-r--r--  1 fangjun  staff   2.0M Jun 26 12:45 decoder-epoch-99-avg-1.onnx
-rw-r--r--  1 fangjun  staff    66M Jun 26 12:45 encoder-epoch-99-avg-1.int8.onnx
-rw-r--r--  1 fangjun  staff   248M Jun 26 12:46 encoder-epoch-99-avg-1.onnx
-rw-r--r--  1 fangjun  staff   253K Jun 26 12:45 joiner-epoch-99-avg-1.int8.onnx
-rw-r--r--  1 fangjun  staff   1.0M Jun 26 12:45 joiner-epoch-99-avg-1.onnx

Decode wave files

Hint

It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate does not need to be 16 kHz.

fp32

The following code shows how to use fp32 models to decode wave files:

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline \
  --tokens=./sherpa-onnx-zipformer-en-2023-06-26/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-en-2023-06-26/encoder-epoch-99-avg-1.onnx \
  --decoder=./sherpa-onnx-zipformer-en-2023-06-26/decoder-epoch-99-avg-1.onnx \
  --joiner=./sherpa-onnx-zipformer-en-2023-06-26/joiner-epoch-99-avg-1.onnx \
  ./sherpa-onnx-zipformer-en-2023-06-26/test_wavs/0.wav \
  ./sherpa-onnx-zipformer-en-2023-06-26/test_wavs/1.wav \
  ./sherpa-onnx-zipformer-en-2023-06-26/test_wavs/8k.wav

Note

Please use ./build/bin/Release/sherpa-onnx-offline.exe for Windows.

You should see the following output:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./build/bin/sherpa-onnx-offline --tokens=./sherpa-onnx-zipformer-en-2023-06-26/tokens.txt --encoder=./sherpa-onnx-zipformer-en-2023-06-26/encoder-epoch-99-avg-1.onnx --decoder=./sherpa-onnx-zipformer-en-2023-06-26/decoder-epoch-99-avg-1.onnx --joiner=./sherpa-onnx-zipformer-en-2023-06-26/joiner-epoch-99-avg-1.onnx ./sherpa-onnx-zipformer-en-2023-06-26/test_wavs/0.wav ./sherpa-onnx-zipformer-en-2023-06-26/test_wavs/1.wav ./sherpa-onnx-zipformer-en-2023-06-26/test_wavs/8k.wav 

OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-zipformer-en-2023-06-26/encoder-epoch-99-avg-1.onnx", decoder_filename="./sherpa-onnx-zipformer-en-2023-06-26/decoder-epoch-99-avg-1.onnx", joiner_filename="./sherpa-onnx-zipformer-en-2023-06-26/joiner-epoch-99-avg-1.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), tokens="./sherpa-onnx-zipformer-en-2023-06-26/tokens.txt", num_threads=2, debug=False, provider="cpu"), lm_config=OfflineLMConfig(model="", scale=0.5), decoding_method="greedy_search", max_active_paths=4, context_score=1.5)
Creating recognizer ...
Started
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/offline-stream.cc:AcceptWaveformImpl:108 Creating a resampler:
   in_sample_rate: 8000
   output_sample_rate: 16000

Done!

./sherpa-onnx-zipformer-en-2023-06-26/test_wavs/0.wav
{"text":" AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS","timestamps":"[0.00, 0.56, 0.64, 0.80, 1.08, 1.36, 1.40, 1.52, 1.68, 1.84, 1.96, 2.04, 2.20, 2.32, 2.40, 2.48, 2.60, 2.80, 3.04, 3.28, 3.40, 3.56, 3.76, 4.08, 4.24, 4.32, 4.48, 4.64, 4.80, 4.84, 5.00, 5.04, 5.28, 5.40, 5.56, 5.60, 5.76, 5.96, 6.12]","tokens":[" AFTER"," E","AR","LY"," NIGHT","F","A","LL"," THE"," YE","LL","OW"," LA","M","P","S"," WOULD"," LIGHT"," UP"," HE","RE"," AND"," THERE"," THE"," S","QUA","LI","D"," ","QUA","R","TER"," OF"," THE"," B","RO","TH","EL","S"]}
----
./sherpa-onnx-zipformer-en-2023-06-26/test_wavs/1.wav
{"text":" GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONORED BOSOM TO CONNECT HER PARENT FOREVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN","timestamps":"[0.00, 0.24, 0.56, 0.76, 0.92, 1.04, 1.16, 1.20, 1.36, 1.52, 1.64, 1.80, 1.88, 2.00, 2.16, 2.32, 2.40, 2.64, 2.88, 3.12, 3.24, 3.48, 3.56, 3.72, 3.92, 4.12, 4.40, 4.52, 4.72, 4.96, 5.16, 5.36, 5.64, 6.12, 6.28, 6.52, 6.88, 7.12, 7.32, 7.56, 7.92, 8.16, 8.28, 8.40, 8.48, 8.64, 8.76, 8.88, 9.04, 9.28, 9.44, 9.52, 9.60, 9.72, 9.92, 9.96, 10.16, 10.48, 10.72, 10.80, 11.04, 11.20, 11.36, 11.56, 11.76, 12.00, 12.12, 12.28, 12.32, 12.52, 12.72, 12.84, 12.92, 13.04, 13.20, 13.44, 13.68, 13.84, 14.00, 14.16, 14.28, 14.40, 14.56, 14.72, 14.76, 15.00, 15.28, 15.48, 15.68, 15.96, 16.16, 16.20, 16.28]","tokens":[" GO","D"," AS"," A"," DI","RE","C","T"," CON","SE","QUE","N","CE"," OF"," THE"," S","IN"," WHICH"," MAN"," TH","US"," P","UN","ISH","ED"," HAD"," GIVE","N"," HER"," A"," LOVE","LY"," CHILD"," WHO","SE"," PLACE"," WAS"," ON"," THAT"," SAME"," DIS","HO","N","OR","ED"," BO","S","OM"," TO"," CON","NE","C","T"," HER"," P","AR","ENT"," FOR","E","VER"," WITH"," THE"," RA","CE"," AND"," DE","S","C","ENT"," OF"," MO","R","T","AL","S"," AND"," TO"," BE"," FI","N","AL","LY"," A"," B","LESS","ED"," SO","UL"," IN"," HE","A","VE","N"]}
----
./sherpa-onnx-zipformer-en-2023-06-26/test_wavs/8k.wav
{"text":" YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION","timestamps":"[0.00, 0.24, 0.40, 0.60, 0.80, 1.04, 1.16, 1.28, 1.36, 1.44, 1.48, 1.68, 1.76, 1.88, 2.00, 2.12, 2.24, 2.28, 2.48, 2.52, 2.80, 3.08, 3.28, 3.52, 3.68, 3.84, 3.96, 4.12, 4.20, 4.32, 4.44]","tokens":[" YE","T"," THE","SE"," THOUGHT","S"," A","FF","E","C","TED"," HE","S","TER"," P","RY","N","NE"," ","LESS"," WITH"," HO","PE"," THAN"," A","PP","RE","HE","N","S","ION"]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 1.301 s
Real time factor (RTF): 1.301 / 28.165 = 0.046

int8

The following code shows how to use int8 models to decode wave files:

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline \
  --tokens=./sherpa-onnx-zipformer-en-2023-06-26/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-en-2023-06-26/encoder-epoch-99-avg-1.int8.onnx \
  --decoder=./sherpa-onnx-zipformer-en-2023-06-26/decoder-epoch-99-avg-1.onnx \
  --joiner=./sherpa-onnx-zipformer-en-2023-06-26/joiner-epoch-99-avg-1.int8.onnx \
  ./sherpa-onnx-zipformer-en-2023-06-26/test_wavs/0.wav \
  ./sherpa-onnx-zipformer-en-2023-06-26/test_wavs/1.wav \
  ./sherpa-onnx-zipformer-en-2023-06-26/test_wavs/8k.wav

Note

Please use ./build/bin/Release/sherpa-onnx-offline.exe for Windows.

You should see the following output:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./build/bin/sherpa-onnx-offline --tokens=./sherpa-onnx-zipformer-en-2023-06-26/tokens.txt --encoder=./sherpa-onnx-zipformer-en-2023-06-26/encoder-epoch-99-avg-1.int8.onnx --decoder=./sherpa-onnx-zipformer-en-2023-06-26/decoder-epoch-99-avg-1.onnx --joiner=./sherpa-onnx-zipformer-en-2023-06-26/joiner-epoch-99-avg-1.int8.onnx ./sherpa-onnx-zipformer-en-2023-06-26/test_wavs/0.wav ./sherpa-onnx-zipformer-en-2023-06-26/test_wavs/1.wav ./sherpa-onnx-zipformer-en-2023-06-26/test_wavs/8k.wav 

OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-zipformer-en-2023-06-26/encoder-epoch-99-avg-1.int8.onnx", decoder_filename="./sherpa-onnx-zipformer-en-2023-06-26/decoder-epoch-99-avg-1.onnx", joiner_filename="./sherpa-onnx-zipformer-en-2023-06-26/joiner-epoch-99-avg-1.int8.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), tokens="./sherpa-onnx-zipformer-en-2023-06-26/tokens.txt", num_threads=2, debug=False, provider="cpu"), lm_config=OfflineLMConfig(model="", scale=0.5), decoding_method="greedy_search", max_active_paths=4, context_score=1.5)
Creating recognizer ...
Started
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/offline-stream.cc:AcceptWaveformImpl:108 Creating a resampler:
   in_sample_rate: 8000
   output_sample_rate: 16000

Done!

./sherpa-onnx-zipformer-en-2023-06-26/test_wavs/0.wav
{"text":" AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS","timestamps":"[0.00, 0.56, 0.64, 0.80, 1.08, 1.36, 1.40, 1.52, 1.68, 1.84, 1.96, 2.04, 2.20, 2.32, 2.40, 2.48, 2.60, 2.76, 3.04, 3.28, 3.40, 3.56, 3.76, 4.08, 4.24, 4.32, 4.48, 4.64, 4.80, 4.84, 5.00, 5.04, 5.28, 5.40, 5.56, 5.60, 5.76, 5.96, 6.12]","tokens":[" AFTER"," E","AR","LY"," NIGHT","F","A","LL"," THE"," YE","LL","OW"," LA","M","P","S"," WOULD"," LIGHT"," UP"," HE","RE"," AND"," THERE"," THE"," S","QUA","LI","D"," ","QUA","R","TER"," OF"," THE"," B","RO","TH","EL","S"]}
----
./sherpa-onnx-zipformer-en-2023-06-26/test_wavs/1.wav
{"text":" GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONORED BOSOM TO CONNECT HER PARENT FOREVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN","timestamps":"[0.00, 0.24, 0.56, 0.76, 0.92, 1.04, 1.16, 1.20, 1.36, 1.52, 1.64, 1.80, 1.88, 2.00, 2.16, 2.32, 2.40, 2.64, 2.88, 3.12, 3.24, 3.48, 3.56, 3.72, 3.92, 4.12, 4.40, 4.52, 4.72, 4.96, 5.12, 5.40, 5.64, 6.12, 6.28, 6.52, 6.88, 7.12, 7.32, 7.60, 7.92, 8.16, 8.28, 8.40, 8.48, 8.64, 8.76, 8.88, 9.04, 9.28, 9.44, 9.52, 9.60, 9.72, 9.92, 9.96, 10.16, 10.48, 10.72, 10.80, 11.04, 11.20, 11.36, 11.56, 11.76, 12.00, 12.12, 12.28, 12.32, 12.52, 12.72, 12.84, 12.92, 13.04, 13.20, 13.44, 13.68, 13.84, 14.00, 14.16, 14.28, 14.40, 14.56, 14.72, 14.76, 15.00, 15.28, 15.48, 15.68, 15.96, 16.16, 16.20, 16.28]","tokens":[" GO","D"," AS"," A"," DI","RE","C","T"," CON","SE","QUE","N","CE"," OF"," THE"," S","IN"," WHICH"," MAN"," TH","US"," P","UN","ISH","ED"," HAD"," GIVE","N"," HER"," A"," LOVE","LY"," CHILD"," WHO","SE"," PLACE"," WAS"," ON"," THAT"," SAME"," DIS","HO","N","OR","ED"," BO","S","OM"," TO"," CON","NE","C","T"," HER"," P","AR","ENT"," FOR","E","VER"," WITH"," THE"," RA","CE"," AND"," DE","S","C","ENT"," OF"," MO","R","T","AL","S"," AND"," TO"," BE"," FI","N","AL","LY"," A"," B","LESS","ED"," SO","UL"," IN"," HE","A","VE","N"]}
----
./sherpa-onnx-zipformer-en-2023-06-26/test_wavs/8k.wav
{"text":" YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION","timestamps":"[0.00, 0.24, 0.40, 0.60, 0.80, 1.04, 1.16, 1.28, 1.36, 1.44, 1.48, 1.68, 1.76, 1.88, 2.00, 2.08, 2.24, 2.28, 2.48, 2.52, 2.80, 3.08, 3.28, 3.52, 3.68, 3.84, 3.96, 4.12, 4.20, 4.32, 4.44]","tokens":[" YE","T"," THE","SE"," THOUGHT","S"," A","FF","E","C","TED"," HE","S","TER"," P","RY","N","NE"," ","LESS"," WITH"," HO","PE"," THAN"," A","PP","RE","HE","N","S","ION"]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 1.106 s
Real time factor (RTF): 1.106 / 28.165 = 0.039

Speech recognition from a microphone

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-microphone-offline \
  --tokens=./sherpa-onnx-zipformer-en-2023-06-26/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-en-2023-06-26/encoder-epoch-99-avg-1.onnx \
  --decoder=./sherpa-onnx-zipformer-en-2023-06-26/decoder-epoch-99-avg-1.onnx \
  --joiner=./sherpa-onnx-zipformer-en-2023-06-26/joiner-epoch-99-avg-1.onnx

icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04 (English)

This model is trained using GigaSpeech + LibriSpeech + Common Voice 13.0 with zipformer

See https://github.com/k2-fsa/icefall/pull/1010 if you are interested in how it is trained.

In the following, we describe how to download it and use it with sherpa-onnx.

Download the model

Please use the following commands to download it.

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04.tar.bz2

tar xvf icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04.tar.bz2
rm icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04.tar.bz2

Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.

$ ls -lh *.onnx
-rw-r--r--  1 fangjun  staff   1.2M May 15 11:11 decoder-epoch-30-avg-4.int8.onnx
-rw-r--r--  1 fangjun  staff   2.0M May 15 11:11 decoder-epoch-30-avg-4.onnx
-rw-r--r--  1 fangjun  staff   121M May 15 11:12 encoder-epoch-30-avg-4.int8.onnx
-rw-r--r--  1 fangjun  staff   279M May 15 11:13 encoder-epoch-30-avg-4.onnx
-rw-r--r--  1 fangjun  staff   253K May 15 11:11 joiner-epoch-30-avg-4.int8.onnx
-rw-r--r--  1 fangjun  staff   1.0M May 15 11:11 joiner-epoch-30-avg-4.onnx

Decode wave files

Hint

It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate does not need to be 16 kHz.

fp32

The following code shows how to use fp32 models to decode wave files:

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline \
  --tokens=./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/data/lang_bpe_500/tokens.txt \
  --encoder=./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/exp/encoder-epoch-30-avg-4.onnx \
  --decoder=./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/exp/decoder-epoch-30-avg-4.onnx \
  --joiner=./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/exp/joiner-epoch-30-avg-4.onnx \
  ./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/test_wavs/1089-134686-0001.wav \
  ./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/test_wavs/1221-135766-0001.wav \
  ./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/test_wavs/1221-135766-0002.wav

Note

Please use ./build/bin/Release/sherpa-onnx-offline.exe for Windows.

You should see the following output:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./build/bin/sherpa-onnx-offline --tokens=./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/data/lang_bpe_500/tokens.txt --encoder=./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/exp/encoder-epoch-30-avg-4.onnx --decoder=./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/exp/decoder-epoch-30-avg-4.onnx --joiner=./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/exp/joiner-epoch-30-avg-4.onnx ./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/test_wavs/1089-134686-0001.wav ./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/test_wavs/1221-135766-0001.wav ./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/test_wavs/1221-135766-0002.wav 

OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/exp/encoder-epoch-30-avg-4.onnx", decoder_filename="./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/exp/decoder-epoch-30-avg-4.onnx", joiner_filename="./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/exp/joiner-epoch-30-avg-4.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), tokens="./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/data/lang_bpe_500/tokens.txt", num_threads=2, debug=False, provider="cpu"), lm_config=OfflineLMConfig(model="", scale=0.5), decoding_method="greedy_search", max_active_paths=4)
Creating recognizer ...
Started
Done!

./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/test_wavs/1089-134686-0001.wav
{"text":" AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS","timestamps":"[0.00,0.40,0.56,0.64,0.96,1.24,1.32,1.44,1.56,1.76,1.88,1.96,2.16,2.32,2.36,2.48,2.60,2.80,3.08,3.28,3.36,3.56,3.80,4.04,4.24,4.32,4.48,4.64,4.84,4.88,5.00,5.08,5.32,5.44,5.56,5.64,5.80,5.96,6.20]","tokens":[" AFTER"," E","AR","LY"," NIGHT","F","A","LL"," THE"," YE","LL","OW"," LA","M","P","S"," WOULD"," LIGHT"," UP"," HE","RE"," AND"," THERE"," THE"," S","QUA","LI","D"," ","QUA","R","TER"," OF"," THE"," B","RO","TH","EL","S"]}
----
./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/test_wavs/1221-135766-0001.wav
{"text":" GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONORED BOSOM TO CONNECT HER PARENT FOR EVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN","timestamps":"[0.00,0.16,0.44,0.68,0.84,1.00,1.12,1.16,1.32,1.48,1.64,1.80,1.84,2.00,2.12,2.28,2.40,2.64,2.88,3.16,3.28,3.56,3.60,3.76,3.92,4.12,4.36,4.52,4.72,4.92,5.16,5.44,5.72,6.12,6.24,6.48,6.84,7.08,7.28,7.56,7.88,8.12,8.28,8.36,8.48,8.60,8.76,8.88,9.12,9.28,9.48,9.56,9.64,9.80,10.00,10.04,10.20,10.44,10.68,10.80,11.04,11.20,11.40,11.56,11.80,12.00,12.12,12.28,12.32,12.52,12.72,12.84,12.96,13.04,13.24,13.40,13.64,13.80,14.00,14.16,14.24,14.36,14.56,14.72,14.80,15.08,15.32,15.52,15.76,16.04,16.16,16.24,16.36]","tokens":[" GO","D"," AS"," A"," DI","RE","C","T"," CON","SE","QUE","N","CE"," OF"," THE"," S","IN"," WHICH"," MAN"," TH","US"," P","UN","ISH","ED"," HAD"," GIVE","N"," HER"," A"," LOVE","LY"," CHILD"," WHO","SE"," PLACE"," WAS"," ON"," THAT"," SAME"," DIS","HO","N","OR","ED"," BO","S","OM"," TO"," CON","NE","C","T"," HER"," P","AR","ENT"," FOR"," E","VER"," WITH"," THE"," RA","CE"," AND"," DE","S","C","ENT"," OF"," MO","R","T","AL","S"," AND"," TO"," BE"," FI","N","AL","LY"," A"," B","LESS","ED"," SO","UL"," IN"," HE","A","VE","N"]}
----
./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/test_wavs/1221-135766-0002.wav
{"text":" YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION","timestamps":"[0.00,0.08,0.32,0.48,0.68,0.92,1.08,1.20,1.28,1.40,1.44,1.64,1.76,1.88,2.04,2.12,2.24,2.32,2.48,2.56,2.88,3.12,3.32,3.52,3.76,3.92,4.00,4.20,4.28,4.40,4.52]","tokens":[" YE","T"," THE","SE"," THOUGHT","S"," A","FF","E","C","TED"," HE","S","TER"," P","RY","N","NE"," ","LESS"," WITH"," HO","PE"," THAN"," A","PP","RE","HE","N","S","ION"]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 1.662 s
Real time factor (RTF): 1.662 / 28.165 = 0.059

int8

The following code shows how to use int8 models to decode wave files:

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline \
  --tokens=./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/data/lang_bpe_500/tokens.txt \
  --encoder=./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/exp/encoder-epoch-30-avg-4.int8.onnx \
  --decoder=./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/exp/decoder-epoch-30-avg-4.onnx \
  --joiner=./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/exp/joiner-epoch-30-avg-4.int8.onnx \
  ./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/test_wavs/1089-134686-0001.wav \
  ./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/test_wavs/1221-135766-0001.wav \
  ./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/test_wavs/1221-135766-0002.wav

Note

Please use ./build/bin/Release/sherpa-onnx-offline.exe for Windows.

You should see the following output:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./build/bin/sherpa-onnx-offline --tokens=./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/data/lang_bpe_500/tokens.txt --encoder=./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/exp/encoder-epoch-30-avg-4.int8.onnx --decoder=./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/exp/decoder-epoch-30-avg-4.onnx --joiner=./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/exp/joiner-epoch-30-avg-4.int8.onnx ./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/test_wavs/1089-134686-0001.wav ./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/test_wavs/1221-135766-0001.wav ./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/test_wavs/1221-135766-0002.wav 

OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/exp/encoder-epoch-30-avg-4.int8.onnx", decoder_filename="./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/exp/decoder-epoch-30-avg-4.onnx", joiner_filename="./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/exp/joiner-epoch-30-avg-4.int8.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), tokens="./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/data/lang_bpe_500/tokens.txt", num_threads=2, debug=False, provider="cpu"), lm_config=OfflineLMConfig(model="", scale=0.5), decoding_method="greedy_search", max_active_paths=4)
Creating recognizer ...
Started
Done!

./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/test_wavs/1089-134686-0001.wav
{"text":" AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS","timestamps":"[0.00,0.40,0.56,0.64,0.96,1.24,1.32,1.44,1.56,1.76,1.88,1.96,2.16,2.32,2.36,2.48,2.60,2.80,3.08,3.28,3.36,3.56,3.80,4.04,4.24,4.32,4.48,4.64,4.84,4.88,5.00,5.08,5.32,5.44,5.56,5.64,5.80,5.96,6.20]","tokens":[" AFTER"," E","AR","LY"," NIGHT","F","A","LL"," THE"," YE","LL","OW"," LA","M","P","S"," WOULD"," LIGHT"," UP"," HE","RE"," AND"," THERE"," THE"," S","QUA","LI","D"," ","QUA","R","TER"," OF"," THE"," B","RO","TH","EL","S"]}
----
./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/test_wavs/1221-135766-0001.wav
{"text":" GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONORED BOSOM TO CONNECT HER PARENT FOR EVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN","timestamps":"[0.00,0.12,0.44,0.68,0.80,1.00,1.12,1.16,1.32,1.48,1.64,1.80,1.84,2.00,2.12,2.28,2.40,2.64,2.88,3.16,3.28,3.56,3.60,3.76,3.92,4.12,4.36,4.52,4.72,4.92,5.16,5.44,5.72,6.12,6.24,6.48,6.84,7.08,7.28,7.56,7.88,8.12,8.28,8.36,8.48,8.60,8.76,8.88,9.12,9.28,9.48,9.56,9.64,9.80,10.00,10.04,10.16,10.44,10.68,10.80,11.04,11.20,11.40,11.56,11.80,12.00,12.16,12.28,12.32,12.52,12.72,12.84,12.96,13.04,13.24,13.40,13.64,13.80,14.00,14.16,14.24,14.36,14.56,14.72,14.80,15.08,15.32,15.52,15.76,16.04,16.16,16.24,16.36]","tokens":[" GO","D"," AS"," A"," DI","RE","C","T"," CON","SE","QUE","N","CE"," OF"," THE"," S","IN"," WHICH"," MAN"," TH","US"," P","UN","ISH","ED"," HAD"," GIVE","N"," HER"," A"," LOVE","LY"," CHILD"," WHO","SE"," PLACE"," WAS"," ON"," THAT"," SAME"," DIS","HO","N","OR","ED"," BO","S","OM"," TO"," CON","NE","C","T"," HER"," P","AR","ENT"," FOR"," E","VER"," WITH"," THE"," RA","CE"," AND"," DE","S","C","ENT"," OF"," MO","R","T","AL","S"," AND"," TO"," BE"," FI","N","AL","LY"," A"," B","LESS","ED"," SO","UL"," IN"," HE","A","VE","N"]}
----
./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/test_wavs/1221-135766-0002.wav
{"text":" YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION","timestamps":"[0.00,0.08,0.32,0.48,0.68,0.92,1.08,1.20,1.28,1.40,1.44,1.64,1.76,1.88,2.04,2.12,2.28,2.32,2.52,2.56,2.88,3.12,3.32,3.52,3.76,3.92,4.00,4.20,4.28,4.40,4.52]","tokens":[" YE","T"," THE","SE"," THOUGHT","S"," A","FF","E","C","TED"," HE","S","TER"," P","RY","N","NE"," ","LESS"," WITH"," HO","PE"," THAN"," A","PP","RE","HE","N","S","ION"]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 1.424 s
Real time factor (RTF): 1.424 / 28.165 = 0.051

Speech recognition from a microphone

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-microphone-offline \
  --tokens=./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/data/lang_bpe_500/tokens.txt \
  --encoder=./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/exp/encoder-epoch-30-avg-4.onnx \
  --decoder=./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/exp/decoder-epoch-30-avg-4.onnx \
  --joiner=./icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04/exp/joiner-epoch-30-avg-4.onnx

csukuangfj/sherpa-onnx-zipformer-en-2023-04-01 (English)

This model is converted from

https://huggingface.co/WeijiZhuang/icefall-asr-librispeech-pruned-transducer-stateless8-2022-12-02

which supports only English as it is trained on the LibriSpeech and GigaSpeech corpus.

You can find the training code at

https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless8

In the following, we describe how to download it and use it with sherpa-onnx.

Download the model

Please use the following commands to download it.

cd /path/to/sherpa-onnx

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-zipformer-en-2023-04-01.tar.bz2

tar xvf sherpa-onnx-zipformer-en-2023-04-01.tar.bz2
rm sherpa-onnx-zipformer-en-2023-04-01.tar.bz2

Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.

sherpa-onnx-zipformer-en-2023-04-01$ ls -lh *.onnx
-rw-r--r-- 1 kuangfangjun root  1.3M Apr  1 14:34 decoder-epoch-99-avg-1.int8.onnx
-rw-r--r-- 1 kuangfangjun root  2.0M Apr  1 14:34 decoder-epoch-99-avg-1.onnx
-rw-r--r-- 1 kuangfangjun root  180M Apr  1 14:34 encoder-epoch-99-avg-1.int8.onnx
-rw-r--r-- 1 kuangfangjun root  338M Apr  1 14:34 encoder-epoch-99-avg-1.onnx
-rw-r--r-- 1 kuangfangjun root  254K Apr  1 14:34 joiner-epoch-99-avg-1.int8.onnx
-rw-r--r-- 1 kuangfangjun root 1003K Apr  1 14:34 joiner-epoch-99-avg-1.onnx

Decode wave files

Hint

It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate does not need to be 16 kHz.

fp32

The following code shows how to use fp32 models to decode wave files:

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline \
  --tokens=./sherpa-onnx-zipformer-en-2023-04-01/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-en-2023-04-01/encoder-epoch-99-avg-1.onnx \
  --decoder=./sherpa-onnx-zipformer-en-2023-04-01/decoder-epoch-99-avg-1.onnx \
  --joiner=./sherpa-onnx-zipformer-en-2023-04-01/joiner-epoch-99-avg-1.onnx \
  ./sherpa-onnx-zipformer-en-2023-04-01/test_wavs/0.wav \
  ./sherpa-onnx-zipformer-en-2023-04-01/test_wavs/1.wav \
  ./sherpa-onnx-zipformer-en-2023-04-01/test_wavs/8k.wav

Note

Please use ./build/bin/Release/sherpa-onnx-offline.exe for Windows.

You should see the following output:

OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-zipformer-en-2023-04-01/encoder-epoch-99-avg-1.onnx", decoder_filename="./sherpa-onnx-zipformer-en-2023-04-01/decoder-epoch-99-avg-1.onnx", joiner_filename="./sherpa-onnx-zipformer-en-2023-04-01/joiner-epoch-99-avg-1.onnx"), paraformer=OfflineParaformerModelConfig(model=""), tokens="./sherpa-onnx-zipformer-en-2023-04-01/tokens.txt", num_threads=2, debug=False), decoding_method="greedy_search")
Creating recognizer ...
2023-04-01 14:40:56.353883875 [E:onnxruntime:, env.cc:251 ThreadMain] pthread_setaffinity_np failed for thread: 638155, index: 16, mask: {17, 53, }, error code: 22 error msg: Invalid argument. Specify the number of threads explicitly so the affinity is not set.
2023-04-01 14:40:56.353881478 [E:onnxruntime:, env.cc:251 ThreadMain] pthread_setaffinity_np failed for thread: 638154, index: 15, mask: {16, 52, }, error code: 22 error msg: Invalid argument. Specify the number of threads explicitly so the affinity is not set.
Started
Creating a resampler:
   in_sample_rate: 8000
   output_sample_rate: 16000

Done!

./sherpa-onnx-zipformer-en-2023-04-01/test_wavs/0.wav
 AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS
----
./sherpa-onnx-zipformer-en-2023-04-01/test_wavs/1.wav
 GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONOURED BOSOM TO CONNECT HER PARENT FOR EVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN
----
./sherpa-onnx-zipformer-en-2023-04-01/test_wavs/8k.wav
 YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 2.151 s
Real time factor (RTF): 2.151 / 28.165 = 0.076

int8

The following code shows how to use int8 models to decode wave files:

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline \
  --tokens=./sherpa-onnx-zipformer-en-2023-04-01/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-en-2023-04-01/encoder-epoch-99-avg-1.int8.onnx \
  --decoder=./sherpa-onnx-zipformer-en-2023-04-01/decoder-epoch-99-avg-1.onnx \
  --joiner=./sherpa-onnx-zipformer-en-2023-04-01/joiner-epoch-99-avg-1.int8.onnx \
  ./sherpa-onnx-zipformer-en-2023-04-01/test_wavs/0.wav \
  ./sherpa-onnx-zipformer-en-2023-04-01/test_wavs/1.wav \
  ./sherpa-onnx-zipformer-en-2023-04-01/test_wavs/8k.wav

Note

Please use ./build/bin/Release/sherpa-onnx-offline.exe for Windows.

You should see the following output:

OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-zipformer-en-2023-04-01/encoder-epoch-99-avg-1.int8.onnx", decoder_filename="./sherpa-onnx-zipformer-en-2023-04-01/decoder-epoch-99-avg-1.onnx", joiner_filename="./sherpa-onnx-zipformer-en-2023-04-01/joiner-epoch-99-avg-1.int8.onnx"), paraformer=OfflineParaformerModelConfig(model=""), tokens="./sherpa-onnx-zipformer-en-2023-04-01/tokens.txt", num_threads=2, debug=False), decoding_method="greedy_search")
Creating recognizer ...
2023-04-01 14:42:00.407939001 [E:onnxruntime:, env.cc:251 ThreadMain] pthread_setaffinity_np failed for thread: 638195, index: 15, mask: {16, 52, }, error code: 22 error msg: Invalid argument. Specify the number of threads explicitly so the affinity is not set.
2023-04-01 14:42:00.407940827 [E:onnxruntime:, env.cc:251 ThreadMain] pthread_setaffinity_np failed for thread: 638196, index: 16, mask: {17, 53, }, error code: 22 error msg: Invalid argument. Specify the number of threads explicitly so the affinity is not set.
Started
Creating a resampler:
   in_sample_rate: 8000
   output_sample_rate: 16000

Done!

./sherpa-onnx-zipformer-en-2023-04-01/test_wavs/0.wav
 AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS
----
./sherpa-onnx-zipformer-en-2023-04-01/test_wavs/1.wav
 GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONOURED BOSOM TO CONNECT HER PARENT FOR EVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN
----
./sherpa-onnx-zipformer-en-2023-04-01/test_wavs/8k.wav
 YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 1.478 s
Real time factor (RTF): 1.478 / 28.165 = 0.052

Speech recognition from a microphone

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-microphone-offline \
  --tokens=./sherpa-onnx-zipformer-en-2023-04-01/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-en-2023-04-01/encoder-epoch-99-avg-1.onnx \
  --decoder=./sherpa-onnx-zipformer-en-2023-04-01/decoder-epoch-99-avg-1.onnx \
  --joiner=./sherpa-onnx-zipformer-en-2023-04-01/joiner-epoch-99-avg-1.onnx

csukuangfj/sherpa-onnx-zipformer-en-2023-03-30 (English)

This model is converted from

https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-transducer-stateless7-2022-11-11

which supports only English as it is trained on the LibriSpeech corpus.

You can find the training code at

https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless7

In the following, we describe how to download it and use it with sherpa-onnx.

Download the model

Please use the following commands to download it.

cd /path/to/sherpa-onnx

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-zipformer-en-2023-03-30.tar.bz2

tar xvf sherpa-onnx-zipformer-en-2023-03-30.tar.bz2
rm sherpa-onnx-zipformer-en-2023-03-30.tar.bz2

Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.

sherpa-onnx-zipformer-en-2023-03-30$ ls -lh *.onnx
-rw-r--r-- 1 kuangfangjun root  1.3M Mar 31 00:37 decoder-epoch-99-avg-1.int8.onnx
-rw-r--r-- 1 kuangfangjun root  2.0M Mar 30 20:10 decoder-epoch-99-avg-1.onnx
-rw-r--r-- 1 kuangfangjun root  180M Mar 31 00:37 encoder-epoch-99-avg-1.int8.onnx
-rw-r--r-- 1 kuangfangjun root  338M Mar 30 20:10 encoder-epoch-99-avg-1.onnx
-rw-r--r-- 1 kuangfangjun root  254K Mar 31 00:37 joiner-epoch-99-avg-1.int8.onnx
-rw-r--r-- 1 kuangfangjun root 1003K Mar 30 20:10 joiner-epoch-99-avg-1.onnx

Decode wave files

Hint

It supports decoding only wave files of a single channel with 16-bit encoded samples, while the sampling rate does not need to be 16 kHz.

fp32

The following code shows how to use fp32 models to decode wave files:

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline \
  --tokens=./sherpa-onnx-zipformer-en-2023-03-30/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-en-2023-03-30/encoder-epoch-99-avg-1.onnx \
  --decoder=./sherpa-onnx-zipformer-en-2023-03-30/decoder-epoch-99-avg-1.onnx \
  --joiner=./sherpa-onnx-zipformer-en-2023-03-30/joiner-epoch-99-avg-1.onnx \
  ./sherpa-onnx-zipformer-en-2023-03-30/test_wavs/0.wav \
  ./sherpa-onnx-zipformer-en-2023-03-30/test_wavs/1.wav \
  ./sherpa-onnx-zipformer-en-2023-03-30/test_wavs/8k.wav

Note

Please use ./build/bin/Release/sherpa-onnx-offline.exe for Windows.

You should see the following output:

OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-zipformer-en-2023-03-30/encoder-epoch-99-avg-1.onnx", decoder_filename="./sherpa-onnx-zipformer-en-2023-03-30/decoder-epoch-99-avg-1.onnx", joiner_filename="./sherpa-onnx-zipformer-en-2023-03-30/joiner-epoch-99-avg-1.onnx"), paraformer=OfflineParaformerModelConfig(model=""), tokens="./sherpa-onnx-zipformer-en-2023-03-30/tokens.txt", num_threads=2, debug=False), decoding_method="greedy_search")
Creating recognizer ...
2023-04-01 06:47:56.620698024 [E:onnxruntime:, env.cc:251 ThreadMain] pthread_setaffinity_np failed for thread: 607690, index: 15, mask: {16, 52, }, error code: 22 error msg: Invalid argument. Specify the number of threads explicitly so the affinity is not set.
2023-04-01 06:47:56.620700026 [E:onnxruntime:, env.cc:251 ThreadMain] pthread_setaffinity_np failed for thread: 607691, index: 16, mask: {17, 53, }, error code: 22 error msg: Invalid argument. Specify the number of threads explicitly so the affinity is not set.
Started
Creating a resampler:
   in_sample_rate: 8000
   output_sample_rate: 16000

Done!

./sherpa-onnx-zipformer-en-2023-03-30/test_wavs/0.wav
 AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS
----
./sherpa-onnx-zipformer-en-2023-03-30/test_wavs/1.wav
 GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONOURED BOSOM TO CONNECT HER PARENT FOR EVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN
----
./sherpa-onnx-zipformer-en-2023-03-30/test_wavs/8k.wav
 YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 1.950 s
Real time factor (RTF): 1.950 / 28.165 = 0.069

int8

The following code shows how to use int8 models to decode wave files:

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline \
  --tokens=./sherpa-onnx-zipformer-en-2023-03-30/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-en-2023-03-30/encoder-epoch-99-avg-1.int8.onnx \
  --decoder=./sherpa-onnx-zipformer-en-2023-03-30/decoder-epoch-99-avg-1.onnx \
  --joiner=./sherpa-onnx-zipformer-en-2023-03-30/joiner-epoch-99-avg-1.int8.onnx \
  ./sherpa-onnx-zipformer-en-2023-03-30/test_wavs/0.wav \
  ./sherpa-onnx-zipformer-en-2023-03-30/test_wavs/1.wav \
  ./sherpa-onnx-zipformer-en-2023-03-30/test_wavs/8k.wav

Note

Please use ./build/bin/Release/sherpa-onnx-offline.exe for Windows.

You should see the following output:

OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="./sherpa-onnx-zipformer-en-2023-03-30/encoder-epoch-99-avg-1.int8.onnx", decoder_filename="./sherpa-onnx-zipformer-en-2023-03-30/decoder-epoch-99-avg-1.onnx", joiner_filename="./sherpa-onnx-zipformer-en-2023-03-30/joiner-epoch-99-avg-1.int8.onnx"), paraformer=OfflineParaformerModelConfig(model=""), tokens="./sherpa-onnx-zipformer-en-2023-03-30/tokens.txt", num_threads=2, debug=False), decoding_method="greedy_search")
Creating recognizer ...
2023-04-01 06:49:34.370117205 [E:onnxruntime:, env.cc:251 ThreadMain] pthread_setaffinity_np failed for thread: 607732, index: 16, mask: {17, 53, }, error code: 22 error msg: Invalid argument. Specify the number of threads explicitly so the affinity is not set.
2023-04-01 06:49:34.370115197 [E:onnxruntime:, env.cc:251 ThreadMain] pthread_setaffinity_np failed for thread: 607731, index: 15, mask: {16, 52, }, error code: 22 error msg: Invalid argument. Specify the number of threads explicitly so the affinity is not set.
Started
Creating a resampler:
   in_sample_rate: 8000
   output_sample_rate: 16000

Done!

./sherpa-onnx-zipformer-en-2023-03-30/test_wavs/0.wav
 AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS
----
./sherpa-onnx-zipformer-en-2023-03-30/test_wavs/1.wav
 GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONOURED BOSOM TO CONNECT HER PARENT FOR EVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN
----
./sherpa-onnx-zipformer-en-2023-03-30/test_wavs/8k.wav
 YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 1.710 s
Real time factor (RTF): 1.710 / 28.165 = 0.061

Speech recognition from a microphone

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-microphone-offline \
  --tokens=./sherpa-onnx-zipformer-en-2023-03-30/tokens.txt \
  --encoder=./sherpa-onnx-zipformer-en-2023-03-30/encoder-epoch-99-avg-1.onnx \
  --decoder=./sherpa-onnx-zipformer-en-2023-03-30/decoder-epoch-99-avg-1.onnx \
  --joiner=./sherpa-onnx-zipformer-en-2023-03-30/joiner-epoch-99-avg-1.onnx