Pre-trained Models

This page describes how to download pre-trained SenseVoice models.

sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17

This model is converted from https://www.modelscope.cn/models/iic/SenseVoiceSmall using the script export-onnx.py.

It supports the following 5 languages:

Chinese (Mandarin, 普通话)

Cantonese (粤语, 广东话)

English

Japanese

Korean

In the following, we describe how to download it.

Download

Please use the following commands to download it:

cd /path/to/sherpa-onnx

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17.tar.bz2

tar xvf sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17.tar.bz2
rm sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17.tar.bz2

After downloading, you should find the following files:

ls -lh sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17

total 1.1G
-rw-r--r-- 1 runner docker   71 Jul 18 13:06 LICENSE
-rw-r--r-- 1 runner docker  104 Jul 18 13:06 README.md
-rwxr-xr-x 1 runner docker 5.8K Jul 18 13:06 export-onnx.py
-rw-r--r-- 1 runner docker 229M Jul 18 13:06 model.int8.onnx
-rw-r--r-- 1 runner docker 895M Jul 18 13:06 model.onnx
drwxr-xr-x 2 runner docker 4.0K Jul 18 13:06 test_wavs
-rw-r--r-- 1 runner docker 309K Jul 18 13:06 tokens.txt

ls -lh sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs

total 940K
-rw-r--r-- 1 runner docker 224K Jul 18 13:06 en.wav
-rw-r--r-- 1 runner docker 226K Jul 18 13:06 ja.wav
-rw-r--r-- 1 runner docker 145K Jul 18 13:06 ko.wav
-rw-r--r-- 1 runner docker 161K Jul 18 13:06 yue.wav
-rw-r--r-- 1 runner docker 175K Jul 18 13:06 zh.wav

Decode a file with model.onnx

Without inverse text normalization

To decode a file without inverse text normalization, please use:

./build/bin/sherpa-onnx-offline \
  --tokens=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/tokens.txt \
  --sense-voice-model=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/model.onnx \
  --num-threads=1 \
  --debug=0 \
  ./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/zh.wav \
  ./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/en.wav

You should see the following output:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:375 ./build/bin/sherpa-onnx-offline --tokens=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/tokens.txt --sense-voice-model=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/model.onnx --num-threads=1 --debug=0 ./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/zh.wav ./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/en.wav

OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80, low_freq=20, high_freq=-400, dither=0), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="", decoder_filename="", joiner_filename=""), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe", tail_paddings=-1), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model=""), wenet_ctc=OfflineWenetCtcModelConfig(model=""), sense_voice=OfflineSenseVoiceModelConfig(model="./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/model.onnx", language="auto", use_itn=False), telespeech_ctc="", tokens="./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/tokens.txt", num_threads=1, debug=False, provider="cpu", model_type="", modeling_unit="cjkchar", bpe_vocab=""), lm_config=OfflineLMConfig(model="", scale=0.5), ctc_fst_decoder_config=OfflineCtcFstDecoderConfig(graph="", max_active=3000), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5, blank_penalty=0, rule_fsts="", rule_fars="")
Creating recognizer ...
Started
Done!

./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/zh.wav
{"text": "开饭时间早上九点至下午五点", "timestamps": [0.72, 0.96, 1.26, 1.44, 1.92, 2.10, 2.58, 2.82, 3.30, 3.90, 4.20, 4.56, 4.74], "tokens":["开", "饭", "时", "间", "早", "上", "九", "点", "至", "下", "午", "五", "点"], "words": []}
----
./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/en.wav
{"text": "the tribal chieftain called for the boy and presented him with fifty pieces of gold", "timestamps": [0.90, 1.26, 1.56, 1.80, 2.16, 2.46, 2.76, 2.94, 3.12, 3.60, 3.96, 4.50, 4.74, 5.10, 5.52, 5.88, 6.18], "tokens":["the", " tri", "bal", " chief", "tain", " called", " for", " the", " boy", " and", " presented", " him", " with", " fifty", " pieces", " of", " gold"], "words": []}
----
num threads: 1
decoding method: greedy_search
Elapsed seconds: 2.320 s
Real time factor (RTF): 2.320 / 12.744 = 0.182

With inverse text normalization

To decode a file with inverse text normalization, please use:

./build/bin/sherpa-onnx-offline \
  --tokens=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/tokens.txt \
  --sense-voice-model=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/model.onnx \
  --num-threads=1 \
  --sense-voice-use-itn=1 \
  --debug=0 \
  ./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/zh.wav \
  ./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/en.wav

You should see the following output:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:375 ./build/bin/sherpa-onnx-offline --tokens=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/tokens.txt --sense-voice-model=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/model.onnx --num-threads=1 --sense-voice-use-itn=1 --debug=0 ./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/zh.wav ./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/en.wav

OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80, low_freq=20, high_freq=-400, dither=0), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="", decoder_filename="", joiner_filename=""), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe", tail_paddings=-1), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model=""), wenet_ctc=OfflineWenetCtcModelConfig(model=""), sense_voice=OfflineSenseVoiceModelConfig(model="./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/model.onnx", language="auto", use_itn=True), telespeech_ctc="", tokens="./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/tokens.txt", num_threads=1, debug=False, provider="cpu", model_type="", modeling_unit="cjkchar", bpe_vocab=""), lm_config=OfflineLMConfig(model="", scale=0.5), ctc_fst_decoder_config=OfflineCtcFstDecoderConfig(graph="", max_active=3000), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5, blank_penalty=0, rule_fsts="", rule_fars="")
Creating recognizer ...
Started
Done!

./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/zh.wav
{"text": "开放时间早上9点至下午5点。", "timestamps": [0.72, 0.96, 1.26, 1.44, 1.92, 2.10, 2.58, 2.82, 3.30, 3.90, 4.20, 4.56, 4.74, 5.46], "tokens":["开", "放", "时", "间", "早", "上", "9", "点", "至", "下", "午", "5", "点", "。"], "words": []}
----
./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/en.wav
{"text": "The tribal chieftain called for the boy and presented him with 50 pieces of gold.", "timestamps": [0.90, 1.26, 1.56, 1.80, 2.16, 2.46, 2.76, 2.94, 3.12, 3.60, 3.96, 4.50, 4.74, 4.92, 5.10, 5.28, 5.52, 5.88, 6.18, 7.02], "tokens":["The", " tri", "bal", " chief", "tain", " called", " for", " the", " boy", " and", " presented", " him", " with", " ", "5", "0", " pieces", " of", " gold", "."], "words": []}
----
num threads: 1
decoding method: greedy_search
Elapsed seconds: 1.543 s
Real time factor (RTF): 1.543 / 12.744 = 0.121

Hint

When inverse text normalziation is enabled, the results also punctuations.

Specify a language

If you don’t provide a language when decoding, it uses auto.

To specify the language when decoding, please use:

./build/bin/sherpa-onnx-offline \
  --tokens=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/tokens.txt \
  --sense-voice-model=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/model.onnx \
  --num-threads=1 \
  --sense-voice-language=zh \
  --debug=0 \
  ./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/zh.wav

You should see the following output:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:375 ./build/bin/sherpa-onnx-offline --tokens=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/tokens.txt --sense-voice-model=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/model.onnx --num-threads=1 --sense-voice-language=zh --debug=0 ./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/zh.wav

OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80, low_freq=20, high_freq=-400, dither=0), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="", decoder_filename="", joiner_filename=""), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe", tail_paddings=-1), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model=""), wenet_ctc=OfflineWenetCtcModelConfig(model=""), sense_voice=OfflineSenseVoiceModelConfig(model="./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/model.onnx", language="zh", use_itn=False), telespeech_ctc="", tokens="./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/tokens.txt", num_threads=1, debug=False, provider="cpu", model_type="", modeling_unit="cjkchar", bpe_vocab=""), lm_config=OfflineLMConfig(model="", scale=0.5), ctc_fst_decoder_config=OfflineCtcFstDecoderConfig(graph="", max_active=3000), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5, blank_penalty=0, rule_fsts="", rule_fars="")
Creating recognizer ...
Started
Done!

./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/test_wavs/zh.wav
{"text": "开饭时间早上九点至下午五点", "timestamps": [0.72, 0.96, 1.26, 1.44, 1.92, 2.10, 2.58, 2.82, 3.30, 3.90, 4.20, 4.56, 4.74], "tokens":["开", "饭", "时", "间", "早", "上", "九", "点", "至", "下", "午", "五", "点"], "words": []}
----
num threads: 1
decoding method: greedy_search
Elapsed seconds: 0.625 s
Real time factor (RTF): 0.625 / 5.592 = 0.112

Hint

Valid values for --sense-voice-language are auto, zh, en, ko, ja, and yue. where zh is for Chinese, en for English, ko for Korean, ja for Japanese, and yue for Cantonese.

Speech recognition from a microphone

./build/bin/sherpa-onnx-microphone-offline \
  --tokens=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/tokens.txt \
  --sense-voice-model=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/model.int8.onnx

Speech recognition from a microphone with VAD

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx

./build/bin/sherpa-onnx-vad-microphone-offline-asr \
  --silero-vad-model=./silero_vad.onnx \
  --tokens=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/tokens.txt \
  --sense-voice-model=./sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/model.int8.onnx