Pre-trained Models

This page describes how to download pre-trained Qwen3-ASR models.

sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25

This model is converted from Qwen3-ASR using scripts from https://github.com/Wasser1462/Qwen3-ASR-onnx.

It supports the following languages:

  • Chinese (zh), English (en), Cantonese (yue), Arabic (ar), German (de)

  • French (fr), Spanish (es), Portuguese (pt), Indonesian (id)

  • Italian (it), Korean (ko), Russian (ru), Thai (th)

  • Vietnamese (vi), Japanese (ja), Turkish (tr), Hindi (hi)

  • Malay (ms), Dutch (nl), Swedish (sv), Danish (da), Finnish (fi)

  • Polish (pl), Czech (cs), Filipino (fil), Persian (fa), Greek (el)

  • Hungarian (hu), Macedonian (mk), Romanian (ro)

It also supports the following Chinese dialects:

  • Anhui, Dongbei, Fujian, Gansu, Guizhou, Hebei, Henan, Hubei, Hunan

  • Jiangxi, Ningxia, Shandong, Shaanxi, Shanxi, Sichuan, Tianjin, Yunnan

  • Zhejiang, Cantonese (Hong Kong accent), Cantonese (Guangdong accent)

  • Wu language, Minnan language.

Hint

支持的语言有:

  • 中文, 英语,粤语,阿拉伯语,德语,法语,西班牙语

  • 葡萄牙语,印尼语,意大利语,韩语,俄语,泰语,越南语

  • 日语,土耳其语,印地语,马来语,荷兰语,瑞典语,丹麦语

  • 芬兰语,波兰语,捷克语,菲律宾语,波斯语,希腊语,匈牙利语

  • 马其顿语,罗马尼亚语

支持的中文方言有:

  • 安徽,东北,福建,甘肃,贵州,河北,河南,湖北,湖南,江西

  • 宁夏,山东,山西,陕西,四川,天津,云南,浙江,粤语(香港口音)

  • 粤语(广东口音), 吴语, 闽南语

此外还支持歌词识别和说唱语音识别。

The sections below show how to use it.

Download

Please use the following commands to download it:

cd /path/to/sherpa-onnx

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25.tar.bz2
tar xvf sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25.tar.bz2
rm sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25.tar.bz2

After downloading, you should find the following files:

ls -lh sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25

total 1954552
-rw-r--r--@  1 fangjun  staff    42M  7 Apr 17:45 conv_frontend.onnx
-rw-r--r--@  1 fangjun  staff   721M  7 Apr 17:50 decoder.int8.onnx
-rw-r--r--@  1 fangjun  staff   174M  7 Apr 17:50 encoder.int8.onnx
-rw-r--r--@  1 fangjun  staff   328B  7 Apr 17:51 README.md
drwxr-xr-x@ 19 fangjun  staff   608B  7 Apr 17:45 test_wavs
drwxr-xr-x@  5 fangjun  staff   160B  7 Apr 17:51 tokenizer
ls -lh sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/tokenizer

total 8728
-rw-r--r--@ 1 fangjun  staff   1.6M  7 Apr 17:50 merges.txt
-rw-r--r--@ 1 fangjun  staff    12K  7 Apr 17:51 tokenizer_config.json
-rw-r--r--@ 1 fangjun  staff   2.6M  7 Apr 17:51 vocab.json
ls -lh sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/test_wavs

total 25624
-rw-r--r--@ 1 fangjun  staff   164K  7 Apr 17:45 ar1.wav
-rw-r--r--@ 1 fangjun  staff   514K  7 Apr 17:45 cantonese.wav
-rw-r--r--@ 1 fangjun  staff   537K  7 Apr 17:45 codeswitch.wav
-rw-r--r--@ 1 fangjun  staff   210K  7 Apr 17:45 de.wav
-rw-r--r--@ 1 fangjun  staff   161K  7 Apr 17:45 es1.wav
-rw-r--r--@ 1 fangjun  staff   1.6M  7 Apr 17:45 f1_noise.wav
-rw-r--r--@ 1 fangjun  staff   980K  7 Apr 17:45 fast1.wav
-rw-r--r--@ 1 fangjun  staff   187K  7 Apr 17:45 fr1.wav
-rw-r--r--@ 1 fangjun  staff   438K  7 Apr 17:45 ja1.wav
-rw-r--r--@ 1 fangjun  staff   2.7M  7 Apr 17:45 noise1-en.wav
-rw-r--r--@ 1 fangjun  staff   724K  7 Apr 17:45 noise2.wav
-rw-r--r--@ 1 fangjun  staff   1.6M  7 Apr 17:45 qiqiu1.wav
-rw-r--r--@ 1 fangjun  staff   1.7M  7 Apr 17:45 raokouling.wav
-rw-r--r--@ 1 fangjun  staff   914K  7 Apr 17:45 rap1.wav
-rw-r--r--@ 1 fangjun  staff    76B  7 Apr 17:45 README.md
-rw-r--r--@ 1 fangjun  staff   149K  7 Apr 17:45 ru1.wav
-rw-r--r--@ 1 fangjun  staff   5.3K  7 Apr 17:45 transcript.txt
cat sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/test_wavs/transcript.txt

fast1.wav 蹦出来之后,左手、右手接一个慢动作,右边再直接拉到这上面之后,直接拉到这个轮胎上,上边再接过去之后,然后上边再直接拉到这个位置了之后,右边再直接这个位置接倒过去的之后,再倒一下,然后右边再直接抓住这个上边了之后,直接从这边上边过去了之后,直接抓住这个树杈,然后这个位置直接倒到这个树杈
raokouling.wav 广西壮族自治区爱吃红鲤鱼与绿鲤鱼与驴的出租车司机,拉着苗族土家族自治州爱喝自制的刘奶奶榴莲牛奶的骨质疏松症患者,遇见别着喇叭的哑巴,打败咬死山前四十四棵紫色柿子树的四十四只石狮子之后,碰到年年恋牛娘的牛郎,念着灰黑灰化肥发黑会挥发,走出香港官方网站设置组,到广西壮族自治区首府南宁市民族医院就医。
noise2.wav 拨号,请再说一次,请说出您要拨打的号码。幺三五八幺八八七五七。一三五八二八八八幺八八。纠正纠正。九六九。纠正纠正,不是九六。
qiqiu1.wav 黑的白的红的粉的紫的绿的蓝的灰的,你的我的他的她的大的小的圆的扁的,好的坏的美的丑的新的旧的,各种款式各种花式,让我选择。飞得高喽,越远越好,天都沦陷,他就死掉,说明多能高兴就好,喜欢就好,没大不了,越变越小,越来越小,快要死掉也很骄傲。你不想说就别再说,我不想听不想再听,就把一切誓言当作气球一般随它而去,我不在意不会在意,随它而去随它而去。气球飘进眼里,飘进风里,结束生命。气球飘进爱里,飘进心里,慢慢死去。
cantonese.wav 今次寻寻觅觅,终于揾到my princess,肯借个场俾我哋玩。你知啦,喺香港地喺繁忙时间要揾个场嚟拍嘢系非常之难嘅。再一次多谢你哋,亦都好多谢片入边嘅每一个人。下一次我哋斗啲咩好?喺下面留言话我哋知啦。拜拜。
f1_noise.wav Okay, Charles. It looks like we have a problem with the radio. What happened? Yeah, someone spilled water on their machine. I uh, yeah. Charles, can you hear us? Mamma mia.
noise1-en.wav My girls, my girls, my girls, my girls. Ready? Hey, babe. Hey, babe. Where are you? I'm actually crazy traffic right now. Oh, really? Yeah. It's crazy. The freeway is completely stopped. Oh, you're still coming to my parents' house, right? Um, I can't really hear you, babe. What? Mariachi band playing live music, yeah. Babe. I can't. They're really loud, and I can't hear. Babe. What? Yeah, they're being really loud right now. I'm sorry. What were you saying? You're still coming to my parents' house, right? It actually started raining like crazy. What? It's raining and thundering like crazy, man. I can't hear shit. Where are you? Out of nowhere, it's just pouring. It's pouring. It's pouring like crazy. What? Insane. I don't think I can get anywhere today. It's crazy day. Babe, that's crazy. Where are you? Oh my God! Someone just hit my car. Come on, get in the car, Gabe. Oh my God! He's getting out of his car. He's getting out of his car. What? Hey, Mari, you just hit my car! Oh, babe, this guy's crazy. What the fuck? Out, out, out, out, out! How you like that? Oh, babe, he's beating the shit out of you. Hold on a sec, babe. Oh, you know I'm gonna get my gun. Here's a gun. Oh, babe, he shot me. He shot me in the leg. He shot. Oh, babe, oh fuck! Babe, I need a driveway. I need to get out of here. I need to get out of here. I'm driving away. Where are you? Babe, this is crazy. No, I'm okay. I'll be fine. I'm okay. I'm okay. I just had to drive. Oh shit, babe, I think I'm getting pulled over now. I think I'm getting pulled over. Pull over your vehicle. Oh my God, babe, hold on a sec. Baby, talk to me. Oh shit. License and registration, sir. Yeah, of course, officer. Of course. Ah, babe, this is the worst day of my life. I just got pulled over. Oh my God. Oh my God. Officer, police. Oh my God. I think a riot's breaking out. We got a crazy riot. People are going crazy. Get out of here! It's like a lotus matter. There's like a war going on or some shit.
rap1.wav Sometimes I just feel like quitting. I still might. Why do I put up this fight? Why do I still write? Sometimes it's hard enough just dealing with real life. Sometimes I wanna jump a state and just kill mics. And so these people want my level of skills, like, but I'm still white. Sometimes I just hate life. Something ain't right. Hit the brake lights. Taste of the state's right. Drawing the blank line. It ain't my fault.
codeswitch.wav I'm alone, all by myself. Je suis tout seul. Sono tutto. Estoy solo.
ar1.wav إطلالات مكياج عيون ذهبي لسهرات صيف عشرين واحد وعشرين بأسلوب النجمات.
de.wav Raptorium Bergbau scheint profitierter als Monroe als Reaktion auf die wirtschaftlichen Ausfälle zu sein.
es1.wav Esta prenda es amplia, recomiendo elegir una talla menor a la habitual.
fr1.wav Alice et moi sommes allés à Paris voyager en train au printemps, c'était très amusant.
ru1.wav Барсук, живущий в киевском зоопарке, совершил побег из своего вольера.
ja1.wav 抜群の運動神経を持ち合わせ、どんな要求にも応えてきた。

Hint

The test wave files are from

raokouling.wav (绕口令, 中文)

To decode the test file

./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/test_wavs/raokouling.wav:

./build/bin/sherpa-onnx-offline \
  --qwen3-asr-conv-frontend=./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/conv_frontend.onnx \
  --qwen3-asr-encoder=./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/encoder.int8.onnx \
  --qwen3-asr-decoder=./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/decoder.int8.onnx \
  --qwen3-asr-tokenizer=./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/tokenizer \
  --qwen3-asr-max-new-tokens=512 \
  --num-threads=2 \
  ./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/test_wavs/raokouling.wav

You should see the following output:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:373 ./build/bin/sherpa-onnx-offline --qwen3-asr-conv-frontend=./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/conv_frontend.onnx --qwen3-asr-encoder=./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/encoder.int8.onnx --qwen3-asr-decoder=./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/decoder.int8.onnx --qwen3-asr-tokenizer=./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/tokenizer --qwen3-asr-max-new-tokens=512 --num-threads=2 ./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/test_wavs/raokouling.wav 

OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80, low_freq=20, high_freq=-400, dither=0, normalize_samples=True, snip_edges=False), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="", decoder_filename="", joiner_filename=""), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe", tail_paddings=-1, enable_token_timestamps=False, enable_segment_timestamps=False), fire_red_asr=OfflineFireRedAsrModelConfig(encoder="", decoder=""), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model=""), wenet_ctc=OfflineWenetCtcModelConfig(model=""), sense_voice=OfflineSenseVoiceModelConfig(model="", language="auto", use_itn=False), moonshine=OfflineMoonshineModelConfig(preprocessor="", encoder="", uncached_decoder="", cached_decoder="", merged_decoder=""), dolphin=OfflineDolphinModelConfig(model=""), canary=OfflineCanaryModelConfig(encoder="", decoder="", src_lang="", tgt_lang="", use_pnc=True), cohere_transcribe=OfflineCohereTranscribeModelConfig(encoder="", decoder="", language="", use_punct=True, use_itn=True), omnilingual=OfflineOmnilingualAsrCtcModelConfig(model=""), funasr_nano=OfflineFunASRNanoModelConfig(encoder_adaptor="", llm="", embedding="", tokenizer="", system_prompt="You are a helpful assistant.", user_prompt="语音转写:", max_new_tokens=512, temperature=1e-06, top_p=0.8, seed=42, language="", itn=True, hotwords=""), medasr=OfflineMedAsrCtcModelConfig(model=""), fire_red_asr_ctc=OfflineFireRedAsrCtcModelConfig(model=""), qwen3_asr=OfflineQwen3ASRModelConfig(conv_frontend="./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/conv_frontend.onnx", encoder="./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/encoder.int8.onnx", decoder="./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/decoder.int8.onnx", tokenizer="./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/tokenizer", hotwords="", max_total_len=512, max_new_tokens=512, temperature=1e-06, top_p=0.8, seed=42), telespeech_ctc="", tokens="", num_threads=2, debug=False, provider="cpu", model_type="", modeling_unit="cjkchar", bpe_vocab=""), lm_config=OfflineLMConfig(model="", scale=0.5, lodr_scale=0.01, lodr_fst="", lodr_backoff_id=-1), ctc_fst_decoder_config=OfflineCtcFstDecoderConfig(graph="", max_active=3000), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5, blank_penalty=0, rule_fsts="", rule_fars="", hr=HomophoneReplacerConfig(lexicon="", rule_fsts=""))
Creating recognizer ...
recognizer created in 1.673 s
Started
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/offline-stream.cc:AcceptWaveformImpl:135 Creating a resampler:
   in_sample_rate: 44100
   output_sample_rate: 16000

Done!

./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/test_wavs/raokouling.wav
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 3.088 s
Real time factor (RTF): 3.088 / 20.760 = 0.149
{"lang": "", "emotion": "", "event": "", "text": "language Chinese<asr_text>广西壮族自治区爱吃红鲤鱼、绿鲤鱼与驴的出租车司机,拉着苗族土家族自治州爱喝自制的刘奶奶榴莲牛奶的古质叔,重症患者遇见别着喇叭的哑巴,打败姚子山前四十四棵死色柿子树的四十四只石狮子之后,碰到年年练流娘的牛郎,念着灰飞灰化肥发黑灰发走出山岗,官方网站设置组到广西壮族自治区首府南宁市民族医院就医。", "timestamps": [], "durations": [], "tokens":["language", " Chinese", "<asr_text>", "广西", "壮", "族自治", "区", "爱吃", "红", "鲤", "鱼", "、", "绿", "鲤", "鱼", "与", "驴", "的", "出租车", "司机", ",", "拉着", "苗", "族", "土", "家族", "自治", "州", "爱", "喝", "自制", "的", "刘", "奶奶", "榴", "莲", "牛奶", "的", "古", "质", "叔", ",", "重症", "患者", "遇见", "别", "着", "喇叭", "的", "哑", "巴", ",", "打败", "姚", "子", "山", "前", "四", "十四", "棵", "死", "色", "柿", "子", "树", "的", "四十", "四", "只", "石", "狮子", "之后", ",", "碰到", "年", "年", "练", "流", "娘", "的", "牛", "郎", ",", "念", "着", "灰", "飞", "灰", "化肥", "发", "黑", "灰", "发", "走出", "山", "岗", ",", "官方", "网站", "设置", "组", "到", "广西", "壮", "族自治", "区", "首", "府", "南宁", "市", "民族", "医院", "就医", "。"], "ys_log_probs": [], "words": []}
Wave filename Content Ground truth
raokouling.wav 广西壮族自治区爱吃红鲤鱼与绿鲤鱼与驴的出租车司机,拉着苗族土家族自治州爱喝自制的刘奶奶榴莲牛奶的骨质疏松症患者,遇见别着喇叭的哑巴,打败咬死山前四十四棵紫色柿子树的四十四只石狮子之后,碰到年年恋牛娘的牛郎,念着灰黑灰化肥发黑会挥发,走出香港官方网站设置组,到广西壮族自治区首府南宁市民族医院就医。
# Use hotwords. Note you use , to separate multiple hotwords

./build/bin/sherpa-onnx-offline \
  --qwen3-asr-conv-frontend=./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/conv_frontend.onnx \
  --qwen3-asr-encoder=./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/encoder.int8.onnx \
  --qwen3-asr-decoder=./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/decoder.int8.onnx \
  --qwen3-asr-tokenizer=./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/tokenizer \
  --qwen3-asr-max-new-tokens=512 \
  --num-threads=2 \
  --qwen3-asr-hotwords="骨质疏松症患者,咬死山前,紫色柿子树,年年恋牛娘,灰黑灰化肥,走出香港" \
  ./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/test_wavs/raokouling.wav

You should see the following output:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:373 ./build/bin/sherpa-onnx-offline --qwen3-asr-conv-frontend=./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/conv_frontend.onnx --qwen3-asr-encoder=./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/encoder.int8.onnx --qwen3-asr-decoder=./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/decoder.int8.onnx --qwen3-asr-tokenizer=./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/tokenizer --qwen3-asr-max-new-tokens=512 --num-threads=2 '--qwen3-asr-hotwords=骨质疏松症患者,咬死山前,紫色柿子树,年年恋牛娘,灰黑灰化肥,走出香港' ./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/test_wavs/raokouling.wav 

OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80, low_freq=20, high_freq=-400, dither=0, normalize_samples=True, snip_edges=False), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="", decoder_filename="", joiner_filename=""), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe", tail_paddings=-1, enable_token_timestamps=False, enable_segment_timestamps=False), fire_red_asr=OfflineFireRedAsrModelConfig(encoder="", decoder=""), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model=""), wenet_ctc=OfflineWenetCtcModelConfig(model=""), sense_voice=OfflineSenseVoiceModelConfig(model="", language="auto", use_itn=False), moonshine=OfflineMoonshineModelConfig(preprocessor="", encoder="", uncached_decoder="", cached_decoder="", merged_decoder=""), dolphin=OfflineDolphinModelConfig(model=""), canary=OfflineCanaryModelConfig(encoder="", decoder="", src_lang="", tgt_lang="", use_pnc=True), cohere_transcribe=OfflineCohereTranscribeModelConfig(encoder="", decoder="", language="", use_punct=True, use_itn=True), omnilingual=OfflineOmnilingualAsrCtcModelConfig(model=""), funasr_nano=OfflineFunASRNanoModelConfig(encoder_adaptor="", llm="", embedding="", tokenizer="", system_prompt="You are a helpful assistant.", user_prompt="语音转写:", max_new_tokens=512, temperature=1e-06, top_p=0.8, seed=42, language="", itn=True, hotwords=""), medasr=OfflineMedAsrCtcModelConfig(model=""), fire_red_asr_ctc=OfflineFireRedAsrCtcModelConfig(model=""), qwen3_asr=OfflineQwen3ASRModelConfig(conv_frontend="./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/conv_frontend.onnx", encoder="./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/encoder.int8.onnx", decoder="./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/decoder.int8.onnx", tokenizer="./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/tokenizer", hotwords="骨质疏松症患者,咬死山前,紫色柿子树,年年恋牛娘,灰黑灰化肥,走出香港", max_total_len=512, max_new_tokens=512, temperature=1e-06, top_p=0.8, seed=42), telespeech_ctc="", tokens="", num_threads=2, debug=False, provider="cpu", model_type="", modeling_unit="cjkchar", bpe_vocab=""), lm_config=OfflineLMConfig(model="", scale=0.5, lodr_scale=0.01, lodr_fst="", lodr_backoff_id=-1), ctc_fst_decoder_config=OfflineCtcFstDecoderConfig(graph="", max_active=3000), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5, blank_penalty=0, rule_fsts="", rule_fars="", hr=HomophoneReplacerConfig(lexicon="", rule_fsts=""))
Creating recognizer ...
recognizer created in 1.271 s
Started
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/offline-stream.cc:AcceptWaveformImpl:135 Creating a resampler:
   in_sample_rate: 44100
   output_sample_rate: 16000

Done!

./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/test_wavs/raokouling.wav
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 3.243 s
Real time factor (RTF): 3.243 / 20.760 = 0.156
{"lang": "", "emotion": "", "event": "", "text": "广西壮族自治区爱吃红鲤鱼、绿鲤鱼与驴的出租车司机,拉着苗族土家族自制粥,爱喝自制的刘奶奶榴莲牛奶的骨质疏松症患者,遇见别着喇叭的哑巴,打败摇死山前四十四棵死色柿子树的四十四只石狮子之后,碰到年年恋牛娘的牛郎,念着灰黑灰化肥发黑会挥发走出香港,官方网站设置组到广西壮族自治区首府南宁市民族医院就医。", "timestamps": [], "durations": [], "tokens":["广西", "壮", "族", "自治区", "爱吃", "红", "鲤", "鱼", "、", "绿", "鲤", "鱼", "与", "驴", "的", "出租车", "司机", ",", "拉着", "苗", "族", "土", "家族", "自制", "粥", ",", "爱", "喝", "自制", "的", "刘", "奶奶", "榴", "莲", "牛奶", "的", "骨", "质", "疏", "松", "症", "患者", ",", "遇见", "别", "着", "喇叭", "的", "哑", "巴", ",", "打败", "摇", "死", "山", "前", "四", "十四", "棵", "死", "色", "柿", "子", "树", "的", "四", "十四", "只", "石", "狮子", "之后", ",", "碰到", "年", "年", "恋", "牛", "娘", "的", "牛", "郎", ",", "念", "着", "灰", "黑", "灰", "化肥", "发", "黑", "会", "挥发", "走出", "香港", ",", "官方网站", "设置", "组", "到", "广西", "壮", "族", "自治区", "首", "府", "南宁", "市", "民族", "医院", "就医", "。"], "ys_log_probs": [], "words": []}

Decode a long audio file with VAD (Example 1/2, English)

The following examples show how to decode a very long audio file with the help of VAD.

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/Obama.wav

./build/bin/sherpa-onnx-vad-with-offline-asr \
  --silero-vad-model=./silero_vad.onnx \
  --silero-vad-threshold=0.2 \
  --silero-vad-min-speech-duration=0.2 \
  --qwen3-asr-conv-frontend=./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/conv_frontend.onnx \
  --qwen3-asr-encoder=./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/encoder.int8.onnx \
  --qwen3-asr-decoder=./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/decoder.int8.onnx \
  --qwen3-asr-tokenizer=./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/tokenizer \
  --qwen3-asr-max-new-tokens=512 \
  --num-threads=2 \
  ./Obama.wav
Wave filename Content
Obama.wav

You should see the following output:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:373 ./build/bin/sherpa-onnx-vad-with-offline-asr --silero-vad-model=./silero_vad.onnx --silero-vad-threshold=0.2 --silero-vad-min-speech-duration=0.2 --qwen3-asr-conv-frontend=./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/conv_frontend.onnx --qwen3-asr-encoder=./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/encoder.int8.onnx --qwen3-asr-decoder=./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/decoder.int8.onnx --qwen3-asr-tokenizer=./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/tokenizer --qwen3-asr-max-new-tokens=512 --num-threads=2 ./Obama.wav 

VadModelConfig(silero_vad=SileroVadModelConfig(model="./silero_vad.onnx", threshold=0.2, min_silence_duration=0.5, min_speech_duration=0.2, max_speech_duration=20, window_size=512, neg_threshold=-1), ten_vad=TenVadModelConfig(model="", threshold=0.5, min_silence_duration=0.5, min_speech_duration=0.25, max_speech_duration=20, window_size=256), sample_rate=16000, num_threads=1, provider="cpu", debug=False)
OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80, low_freq=20, high_freq=-400, dither=0, normalize_samples=True, snip_edges=False), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="", decoder_filename="", joiner_filename=""), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe", tail_paddings=-1, enable_token_timestamps=False, enable_segment_timestamps=False), fire_red_asr=OfflineFireRedAsrModelConfig(encoder="", decoder=""), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model=""), wenet_ctc=OfflineWenetCtcModelConfig(model=""), sense_voice=OfflineSenseVoiceModelConfig(model="", language="auto", use_itn=False), moonshine=OfflineMoonshineModelConfig(preprocessor="", encoder="", uncached_decoder="", cached_decoder="", merged_decoder=""), dolphin=OfflineDolphinModelConfig(model=""), canary=OfflineCanaryModelConfig(encoder="", decoder="", src_lang="", tgt_lang="", use_pnc=True), cohere_transcribe=OfflineCohereTranscribeModelConfig(encoder="", decoder="", language="", use_punct=True, use_itn=True), omnilingual=OfflineOmnilingualAsrCtcModelConfig(model=""), funasr_nano=OfflineFunASRNanoModelConfig(encoder_adaptor="", llm="", embedding="", tokenizer="", system_prompt="You are a helpful assistant.", user_prompt="语音转写:", max_new_tokens=512, temperature=1e-06, top_p=0.8, seed=42, language="", itn=True, hotwords=""), medasr=OfflineMedAsrCtcModelConfig(model=""), fire_red_asr_ctc=OfflineFireRedAsrCtcModelConfig(model=""), qwen3_asr=OfflineQwen3ASRModelConfig(conv_frontend="./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/conv_frontend.onnx", encoder="./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/encoder.int8.onnx", decoder="./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/decoder.int8.onnx", tokenizer="./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/tokenizer", hotwords="", max_total_len=512, max_new_tokens=512, temperature=1e-06, top_p=0.8, seed=42), telespeech_ctc="", tokens="", num_threads=2, debug=False, provider="cpu", model_type="", modeling_unit="cjkchar", bpe_vocab=""), lm_config=OfflineLMConfig(model="", scale=0.5, lodr_scale=0.01, lodr_fst="", lodr_backoff_id=-1), ctc_fst_decoder_config=OfflineCtcFstDecoderConfig(graph="", max_active=3000), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5, blank_penalty=0, rule_fsts="", rule_fars="", hr=HomophoneReplacerConfig(lexicon="", rule_fsts=""))
Creating recognizer ...
Recognizer created!
Started
Reading: ./Obama.wav
Started!
8.976 -- 12.140: Thank you, everybody. All right, everybody, go ahead and have a seat.
13.104 -- 14.540: How's everybody doing today?
18.704 -- 22.892: How about Tim Spicer?
25.936 -- 31.884: I am here with students at Wakefield High School in Arlington, Virginia.
32.720 -- 48.844: And we've got students tuning in from all across America, from kindergarten through twelfth grade. And I am just so glad that all could join us today. And I want to thank Wakefield for being such an outstanding host. Give yourselves a big round of applause.
54.416 -- 55.436: I know that.
56.240 -- 58.892: For many of you, today is the first day of school.
59.600 -- 69.452: And for those of you in kindergarten or starting middle or high school, it's your first day in a new school, so it's understandable if you're a little nervous.
70.640 -- 76.332: I imagine there is some seniors out there who are feeling pretty good right now. With just one more year to go.
78.800 -- 87.180: And no matter what grade you're in, some of you are probably wishing it were still summer, and you could have stayed in bed just a little bit longer this morning.
87.984 -- 89.100: I know that feel.
91.664 -- 111.708: When I was young, my family lived overseas. I lived in Indonesia for a few years, and my mother, she didn't have the money to send me where all the American kids went to school, but she thought it was important for me to keep up with an American education. So she decided to teach me extra lessons herself.
112.240 -- 118.700: Monday through Friday, but because she had to go to work, the only time she could do it was at four thirty in the morning.
120.048 -- 127.244: Now, as you might imagine, I wasn't too happy about getting up that early. A lot of times, I'd fall asleep right there at the kitchen table.
128.272 -- 135.340: But whenever I'd complain, my mother would just give me one of those looks, and she'd say, "This is no picnic for me either, Buster."
137.104 -- 145.132: So I know that some of you are still adjusting to being back at school, but I'm here today because I have something important to discuss with you.
145.808 -- 153.740: I'm here because I want to talk with you about your education, and what's expected of all of you in this new school year.
154.448 -- 160.268: I've given a lot of speeches about education, and I've talked about responsibility a lot.
160.816 -- 178.220: I've talked about teachers' responsibility for inspiring students and pushing you to learn. I've talked about your parents' responsibility for making sure you stay on track and you get your homework done and don't spend every waking hour in front of the TV or with the Xbox.
179.088 -- 180.716: I've talked a lot about.
181.360 -- 193.452: Your government's responsibility for setting high standards, in supporting teachers and principals, and turning around schools that aren't working, where students aren't getting the opportunities that they deserve.
194.000 -- 195.276: But at the end of the day.
196.016 -- 206.156: We can have the most dedicated teachers, the most supportive parents, the best schools in the world, and none of it will make a difference. None of it will matter.
206.704 -- 210.604: Unless all of you fulfill your responsibilities.
211.248 -- 223.404: Unless you show up to those schools, unless you pay attention to those teachers, unless you listen to your parents and grandparents and other adults, and put in the hard work it takes to succeed.
224.656 -- 230.924: That's what I want to focus on today: the responsibility each of you has for your education.
231.728 -- 234.796: I want to start with the responsibility you have to yourself.
235.696 -- 238.988: Every single one of you has something that you're good at.
239.760 -- 242.412: Every single one of you has something to offer.
242.992 -- 247.404: And you have a responsibility to yourself to discover what that is.
248.336 -- 251.564: That's the opportunity an education can provide.
252.336 -- 265.900: Maybe you could be a great writer, maybe even good enough to write a book or articles in a newspaper, but you might not know it until you write that English paper, that English class paper that's assigned to you.
266.704 -- 278.668: Maybe you could be an innovator, an inventor. Maybe even good enough to come up with the next iPhone or the new medicine or vaccine. But you might not know it until you do your project for your science class.
279.824 -- 289.964: Maybe you could be a mayor, or a senator, or a Supreme Court justice, but you might not know that until you join student government or the debate team.
291.568 -- 309.516: And no matter what you want to do with your life, I guarantee that you'll need an education to do it. You want to be a doctor, or a teacher, or a police officer, you want to be a nurse, or an architect, a lawyer, or a member of our military—you are going to need a good education for every single one of those careers.
310.064 -- 314.348: You cannot drop out of school and just drop into a good job.
315.184 -- 319.852: You've got to train for it, and work for it, and learn for it.
320.528 -- 323.628: And this isn't just important for your own life and your own future.
324.688 -- 332.812: What you make of your education will decide nothing less than the future of this country. The future of America depends on you.

num threads: 2
decoding method: greedy_search
Elapsed seconds: 34.480 s
Real time factor (RTF): 34.480 / 334.234 = 0.103

Decode a long audio file with VAD (Example 2/2, Chinese)

The following examples show how to decode a very long audio file with the help of VAD.

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/silero_vad.onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/lei-jun-test.wav

./build/bin/sherpa-onnx-vad-with-offline-asr \
  --silero-vad-model=./silero_vad.onnx \
  --silero-vad-threshold=0.2 \
  --silero-vad-min-speech-duration=0.2 \
  --qwen3-asr-conv-frontend=./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/conv_frontend.onnx \
  --qwen3-asr-encoder=./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/encoder.int8.onnx \
  --qwen3-asr-decoder=./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/decoder.int8.onnx \
  --qwen3-asr-tokenizer=./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/tokenizer \
  --qwen3-asr-max-new-tokens=512 \
  --num-threads=2 \
  ./lei-jun-test.wav
Wave filename Content
lei-jun-test.wav

You should see the following output:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:373 ./build/bin/sherpa-onnx-vad-with-offline-asr --silero-vad-model=./silero_vad.onnx --silero-vad-threshold=0.2 --silero-vad-min-speech-duration=0.2 --qwen3-asr-conv-frontend=./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/conv_frontend.onnx --qwen3-asr-encoder=./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/encoder.int8.onnx --qwen3-asr-decoder=./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/decoder.int8.onnx --qwen3-asr-tokenizer=./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/tokenizer --qwen3-asr-max-new-tokens=512 --num-threads=2 ./lei-jun-test.wav 

VadModelConfig(silero_vad=SileroVadModelConfig(model="./silero_vad.onnx", threshold=0.2, min_silence_duration=0.5, min_speech_duration=0.2, max_speech_duration=20, window_size=512, neg_threshold=-1), ten_vad=TenVadModelConfig(model="", threshold=0.5, min_silence_duration=0.5, min_speech_duration=0.25, max_speech_duration=20, window_size=256), sample_rate=16000, num_threads=1, provider="cpu", debug=False)
OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80, low_freq=20, high_freq=-400, dither=0, normalize_samples=True, snip_edges=False), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="", decoder_filename="", joiner_filename=""), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe", tail_paddings=-1, enable_token_timestamps=False, enable_segment_timestamps=False), fire_red_asr=OfflineFireRedAsrModelConfig(encoder="", decoder=""), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model=""), wenet_ctc=OfflineWenetCtcModelConfig(model=""), sense_voice=OfflineSenseVoiceModelConfig(model="", language="auto", use_itn=False), moonshine=OfflineMoonshineModelConfig(preprocessor="", encoder="", uncached_decoder="", cached_decoder="", merged_decoder=""), dolphin=OfflineDolphinModelConfig(model=""), canary=OfflineCanaryModelConfig(encoder="", decoder="", src_lang="", tgt_lang="", use_pnc=True), cohere_transcribe=OfflineCohereTranscribeModelConfig(encoder="", decoder="", language="", use_punct=True, use_itn=True), omnilingual=OfflineOmnilingualAsrCtcModelConfig(model=""), funasr_nano=OfflineFunASRNanoModelConfig(encoder_adaptor="", llm="", embedding="", tokenizer="", system_prompt="You are a helpful assistant.", user_prompt="语音转写:", max_new_tokens=512, temperature=1e-06, top_p=0.8, seed=42, language="", itn=True, hotwords=""), medasr=OfflineMedAsrCtcModelConfig(model=""), fire_red_asr_ctc=OfflineFireRedAsrCtcModelConfig(model=""), qwen3_asr=OfflineQwen3ASRModelConfig(conv_frontend="./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/conv_frontend.onnx", encoder="./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/encoder.int8.onnx", decoder="./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/decoder.int8.onnx", tokenizer="./sherpa-onnx-qwen3-asr-0.6B-int8-2026-03-25/tokenizer", hotwords="", max_total_len=512, max_new_tokens=512, temperature=1e-06, top_p=0.8, seed=42), telespeech_ctc="", tokens="", num_threads=2, debug=False, provider="cpu", model_type="", modeling_unit="cjkchar", bpe_vocab=""), lm_config=OfflineLMConfig(model="", scale=0.5, lodr_scale=0.01, lodr_fst="", lodr_backoff_id=-1), ctc_fst_decoder_config=OfflineCtcFstDecoderConfig(graph="", max_active=3000), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5, blank_penalty=0, rule_fsts="", rule_fars="", hr=HomophoneReplacerConfig(lexicon="", rule_fsts=""))
Creating recognizer ...
Recognizer created!
Started
Reading: ./lei-jun-test.wav
Started!

29.776 -- 35.788: 晚上好,欢迎大家来参加今天晚上的活动。谢谢大家。
42.160 -- 45.996: 这是我第四次办年度演讲。
47.024 -- 49.868: 前三次呢,因为疫情的原因。
50.512 -- 55.340: 都在小米科技园内举办,现场的人很少。
56.176 -- 57.388: 这是第四次。
58.192 -- 66.892: 我们仔细想了想,我们还是想办一个比较大的聚会,然后呢,让我们的新朋友、老朋友一起聚一聚。
67.760 -- 70.828: 今天的话呢,我们就在北京的。
71.664 -- 74.828: 国家会议中心呢,举办了这么一个活动。
75.472 -- 85.868: 现场呢,来了很多人,大概有三千五百人,还有很多很多的朋友呢,通过观看直播的方式来参与。
86.352 -- 91.308: 再一次呢,对大家的参加表示感谢,谢谢大家。
98.512 -- 99.692: 两个月前。
100.400 -- 104.396: 我参加了今年武汉大学的毕业典礼。
105.936 -- 107.276: 今年呢是。
107.888 -- 110.572: 武汉大学建校一百三十周年。
111.760 -- 117.196: 作为校友,被母校邀请,在毕业典礼上致辞。
118.032 -- 122.732: 这对我来说是至高无上的荣誉。
123.664 -- 128.556: 站在讲台的那一刻,面对全校师生。
129.200 -- 134.252: 关于武大的所有的记忆,一下子涌现在脑海里。
134.960 -- 139.436: 今天呢,我就先和大家聊聊五大往事。
141.840 -- 143.980: 那还是三十六年前。
145.936 -- 147.660: 一九八七年。
148.688 -- 151.564: 我呢,考上了武汉大学的计算机系。
152.688 -- 156.748: 在武汉大学的图书馆里,看了一本书。
157.584 -- 161.804: 硅谷之火,建立了我一生的梦想。
163.312 -- 164.652: 看完书以后。
165.264 -- 166.636: 热血沸腾。
167.600 -- 169.548: 激动得睡不着觉。
170.416 -- 171.404: 我还记得。
172.016 -- 174.700: 那天晚上,星光很亮。
175.408 -- 179.820: 我就在武大的操场上,就是屏幕上这个操场。
180.816 -- 185.228: 走了一圈又一圈,走了整整一个晚上。
186.480 -- 187.916: 我心里有团火。
188.912 -- 192.076: 我也想搬一个伟大的公司。
193.968 -- 195.020: 就是这样。
197.648 -- 202.316: 梦想之火,在我心里彻底点燃了。
209.968 -- 212.396: 是一个大一的新生。
220.496 -- 222.636: 是一个大一的新生。
223.984 -- 226.892: 一个从县城里出来的年轻人。
228.368 -- 230.604: 什么也不会,什么也没有。
231.568 -- 236.204: 就想创办一家伟大的公司,这不就是天方夜谭吗?
237.616 -- 239.788: 这么离谱的一个梦想。
240.400 -- 242.316: 该如何实现呢?
243.856 -- 246.924: 那天晚上,我想了一整晚上。
247.952 -- 249.068: 说实话。
250.352 -- 253.868: 越想越糊涂,完全理不清头绪。
254.960 -- 265.836: 后来我在想:“哎,干脆别想了,把书念好是正事。”所以呢,我就下定决心,认认真真读书。
266.640 -- 267.468: 那么。
268.496 -- 271.564: 我怎么能够把书读得不同凡响呢?

num threads: 2
decoding method: greedy_search
Elapsed seconds: 20.968 s
Real time factor (RTF): 20.968 / 272.448 = 0.077