Hotwords (Contextual biasing)

In this section, we describe how we implement the hotwords (aka contextual biasing) feature with an Aho-corasick automaton and how to use it in sherpa-onnx.

Caution

Only transducer models support hotwords in sherpa-onnx. That is, only models from Offline transducer models and Online transducer models support hotwords.

All other models don’t support hotwords.

Also, you have to change the decoding method to modified_beam_search to use hotwords. The default decoding method greedy_search does not support hotwords.

What are hotwords

Current ASR systems work very well for general cases, but they sometimes fail to recognize special words/phrases (aka hotwords) like rare words, personalized information etc. Usually, those words/phrases will be recognized as the words/phrases that pronounce similar to them (for example, recognize LOUIS FOURTEEN as LEWIS FOURTEEN). So we have to provide some kind of contexts information (for example, the LOUIS FOURTEEN) to the ASR systems to boost those words/phrases. Normally, we call this kind of boosting task contextual biasing (aka hotwords recognition).

How do we implement it with an Aho-corasick

We first construct an Aho-corasick automaton on those given hotwords (after tokenizing into tokens). Please refer to https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm for the construction details of Aho-corasick.

The figure below is the aho-corasick on “HE/SHE/SHELL/HIS/THIS” with hotwords-score==1.

The aho-corasick for "HE/SHE/SHELL/HIS/THIS"

Fig. 124 The Aho-corasick for “HE/SHE/SHELL/HIS/THIS”

The black arrows in the graph are the goto arcs, the red arrows are the failure arcs, the green arrows are the output arcs. On each goto arc, there are token and boosting score (Note: we will boost the path when any partial sequence is matched, if the path finally fails to full match any hotwords, the boosted score will be canceled). Currentlly, the boosting score distributes on the arcs evenly along the path. On each state, there are two scores, the first one is the node score (mainly used to cancel score) the second one is output score, the output score is the total scores of the full matched hotwords of this state.

The following are several matching examples of the graph above.

Note

For simplicity, we assume that the system emits a token each frame.

Hint

We have an extra finalize step to force the graph state to go back to the root state.

The path is “SHELF”

Frame

Boost score

Total boost score

Graph state

Matched hotwords

init

0

0

0

1

1

1

3

2

1

2

4

3

1 + 5

8

5

HE, SHE

4

1

9

6

5

-4

5

0

finalize

0

5

0

At frame 3 we reach state 5 and match HE, SHE, so we get a boosting score 1 + 5, the score 1 here because the SHEL still might be the prefix of other hotwords. At frame 5 F can not match any tokens and fail back to root, so we cancel the score for SHEL which is 4 (the node score of state 6).

The path is “HI”

Frame

Boost score

Total boost score

Graph state

Matched hotwords

init

0

0

0

1

1

1

1

2

1

2

8

finalize

-2

0

0

H and I all match the tokens in the graph, unfortunately, we have to go back to root state when finishing matching a path, so we cancel the boosting score of HI which is 2 (the node score of state 8).

The path is “THE”

Frame

Boost score

Total boost score

Graph state

Matched hotwords

init

0

0

0

1

1

1

10

2

1

2

11

3

0 + 2

4

2

HE

finalize

-2

3

0

At frame 3 we jump from state 11 to state 2 and get a boosting score of 0 + 2, 0 because the node score of state 2 is the same as state 11 so we don’t get score by partial match (the prefix of state 11 is TH has the same length of the prefix of state 2 which is HE), but we do get the output score (at state 2 it outputs HE).

Note

We implement the hotwords feature during inference time, you don’t have to re-train the models to use this feature.

How to use hotwords in sherpa-onnx

Caution

Currentlly, the hotwords feature is only supported in the modified_beam_search decoding method of the transducer models (both streaming and non-streaming).

The use of the hotwords is no different for streaming and non-streaming models, and in fact it is even no different for all the API supported by sherpa onnx. We add FOUR extra arguments for hotwords:

  • hotwords-file

    The file path of the hotwords, one hotwords per line. They could be Chinese words, English words or both according to the modeling units used to train the model. Here are some examples:

    For Chinese models trained on cjkchar it looks like:

    语音识别
    深度学习
    

    For English like language models trained on bpe it looks like:

    SPEECH RECOGNITION
    DEEP LEARNING
    

    For multilingual models trained on cjkchar+bpe (Chinese + English) it looks like:

    SPEECH 识别
    SPEECH RECOGNITION
    深度学习
    

    You can also specify the boosting score for each hotword, the score should follow the predefined character :, for example:

    语音识别 :3.5
    深度学习 :2.0
    

    It means, hotword 语音识别 will have a boosting score of 3.5, hotword 深度学习 will have a boosting score of 2.0. For those hotwords that don’t have specific scores, they will use the global score provided by hotword-score below.

    Caution

    The specific score MUST BE the last item of each hotword (i.e You shouldn’t break the hotword into two parts by the score). SPEECH :2.0 识别 # This is invalid

  • hotwords-score

    The boosting score for each matched token.

    Note

    We match the hotwords at token level, so the hotwords-score is applied at token level.

  • modeling-unit

    The modeling unit of the used model, currently support cjkchar (for Chinese), bpe (for English like languages) and cjkchar+bpe (for multilingual models). We need this modeling-unit to select tokenizer to encode words/phrases into tokens, so do provide correct modeling-unit according to your model.

  • bpe-vocab

    The bpe vocabulary generated by sentencepiece toolkit, it also can be exported from bpe.model (see script/export_bpe_vocab.py for details). This vocabulary is used to tokenize words/phrases into bpe units. It is only used when modeling-unit is bpe or cjkchar+bpe.

    Hint

    We need bpe.vocab rather than bpe.model, because we don’t introduce sentencepiece c++ codebase into sherpa-onnx (which has a depandancy issue of protobuf) , we implement a simple sentencepiece encoder and decoder which takes bpe.vocab as input.

The main difference of using hotwords feature is about the modeling units. The following shows how to use it for different modeling units.

Hint

You can use any transducer models here https://k2-fsa.github.io/sherpa/onnx/pretrained_models/index.html, we just choose three of them randomly for the following examples.

Note

In the following example, we use a non-streaming model, if you are using a streaming model, you should use sherpa-onnx. sherpa-onnx-alsa, sherpa-onnx-microphone, sherpa-onnx-microphone-offline, sherpa-onnx-online-websocket-server and sherpa-onnx-offline-websocket-server all support hotwords.

Modeling unit is bpe

Download the model

cd /path/to/sherpa-onnx

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-zipformer-en-2023-04-01.tar.bz2
tar xvf sherpa-onnx-zipformer-en-2023-04-01.tar.bz2
rm sherpa-onnx-zipformer-en-2023-04-01.tar.bz2
ln -s sherpa-onnx-zipformer-en-2023-04-01 exp

Export bpe.vocab if you can’t find bpe.vocab in the model directory.

python script/export_bpe_vocab.py --bpe-model exp/bpe.model

The hotwords_en.txt contains:

QUARTERS
FOREVER

C++ api

Decoding without hotwords

./build/bin/sherpa-onnx-offline \
  --encoder=exp/encoder-epoch-99-avg-1.onnx \
  --decoder=exp/decoder-epoch-99-avg-1.onnx \
  --joiner=exp/joiner-epoch-99-avg-1.onnx \
  --decoding-method=modified_beam_search \
  --tokens=exp/tokens.txt \
  exp/test_wavs/0.wav exp/test_wavs/1.wav

The output is:

/star-kw/kangwei/code/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./build/bin/sherpa-onnx-offline --encoder=exp/encoder-epoch-99-avg-1.onnx --decoder=exp/decoder-epoch-99-avg-1.onnx --joiner=exp/joiner-epoch-99-avg-1.onnx --decoding-method=modified_beam_search --tokens=exp/tokens.txt exp/test_wavs/0.wav exp/test_wavs/1.wav

OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OfflineModelConfig(transducer=OfflineTran$ducerModelConfig(encoder_filename="exp/encoder-epoch-99-avg-1.onnx", decoder_filename="exp/decoder-epoch-99-avg-1.onnx", joiner_filename="exp/joiner-epoch-99-$vg-1.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder=$", decoder="", language="", task="transcribe"), tdnn=OfflineTdnnModelConfig(model=""), tokens="exp/tokens.txt", num_threads$2, debug=False, provider="cpu", model_type=""), lm_config=OfflineLMConfig(model="", scale=0.5), decoding_method="modified_beam_search", max_active_paths=4, ho$words_file=, hotwords_score=1.5)
Creating recognizer ...
Started
Done!

exp/test_wavs/0.wav
{"text":"ALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS","timestamps":"[1.44, 1.48, 1.56, 1.72, 1.88, 1.96, 2.16, 2.28$ 2.36, 2.48, 2.60, 2.80, 3.08, 3.28, 3.40, 3.60, 3.80, 4.08, 4.24, 4.32, 4.48, 4.64, 4.84, 4.88, 5.00, 5.08, 5.32, 5.48, 5.60, 5.68, 5.84, 6.04, 6.24]","token$":["A","LL"," THE"," YE","LL","OW"," LA","M","P","S"," WOULD"," LIGHT"," UP"," HE","RE"," AND"," THERE"," THE"," S","QUA","LI","D"," ","QUA","R","TER"," OF","THE"," B","RO","TH","EL","S"]}
----
exp/test_wavs/1.wav
{"text":"IN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONOURED BOSOM TO CONNECT HER PARENT FOR EVER WITH THE RACE AN
D DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN","timestamps":"[2.44, 2.64, 2.88, 3.16, 3.28, 3.48, 3.60, 3.80, 3.96, 4.12, 4.36, 4.52, 4.72, 4
.92, 5.16, 5.44, 5.68, 6.04, 6.24, 6.48, 6.84, 7.08, 7.32, 7.56, 7.84, 8.12, 8.24, 8.32, 8.44, 8.60, 8.76, 8.88, 9.08, 9.28, 9.44, 9.56, 9.64, 9.76, 9.96, 10.0
4, 10.20, 10.40, 10.64, 10.76, 11.04, 11.20, 11.36, 11.60, 11.80, 12.00, 12.12, 12.28, 12.32, 12.52, 12.72, 12.84, 12.96, 13.04, 13.24, 13.40, 13.60, 13.76, 13
.96, 14.12, 14.24, 14.36, 14.52, 14.68, 14.76, 15.04, 15.28, 15.52, 15.76, 16.00, 16.16, 16.24, 16.32]","tokens":["IN"," WHICH"," MAN"," TH","US"," P","UN","IS
H","ED"," HAD"," GIVE","N"," HER"," A"," LOVE","LY"," CHILD"," WHO","SE"," PLACE"," WAS"," ON"," THAT"," SAME"," DIS","HO","N","OUR","ED"," BO","S","OM"," TO",
" CON","NE","C","T"," HER"," P","AR","ENT"," FOR"," E","VER"," WITH"," THE"," RA","CE"," AND"," DE","S","C","ENT"," OF"," MO","R","T","AL","S"," AND"," TO"," B
E"," FI","N","AL","LY"," A"," B","LESS","ED"," SO","UL"," IN"," HE","A","VE","N"]}
----
num threads: 2
decoding method: modified_beam_search
max active paths: 4
Elapsed seconds: 1.775 s
Real time factor (RTF): 1.775 / 23.340 = 0.076

Decoding with hotwords

./build/bin/sherpa-onnx-offline \
    --encoder=exp/encoder-epoch-99-avg-1.onnx \
    --decoder=exp/decoder-epoch-99-avg-1.onnx \
    --joiner=exp/joiner-epoch-99-avg-1.onnx \
    --decoding-method=modified_beam_search \
    --tokens=exp/tokens.txt \
    --modeling-unit=bpe \
    --bpe-vocab=exp/bpe.vocab \
    --hotwords-file=hotwords_en.txt \
    --hotwords-score=2.0 \
    exp/test_wavs/0.wav exp/test_wavs/1.wav

The output is:

/star-kw/kangwei/code/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./build/bin/sherpa-onnx-offline --encoder=exp/encoder-epoch-99-avg-1.onnx --decoder=exp/decoder-epoch-99-avg-1.onnx --joiner=exp/joiner-epoch-99-avg-1.onnx --decoding-method=modified_beam_search --tokens=exp/tokens.txt --hotwords-file=hotwords_en.txt --hotwords-score=2.0 exp/test_wavs/0.wav exp/test_wavs/1.wav

OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="exp/encoder-epoch-99-avg-1.onnx", decoder_filename="exp/decoder-epoch-99-avg-1.onnx", joiner_filename="exp/joiner-epoch-99-avg-1.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="
", decoder="", language="", task="transcribe"), tdnn=OfflineTdnnModelConfig(model=""), tokens="exp/tokens.txt", num_threads=2, debug=False, provider="cpu", model_type=""), lm_config=OfflineLMConfig(model="", scale=0.5), decoding_method="modified_beam_search", max_active_paths=4, hotwords_file=hotwords_en.txt, hotwords_score=2)
Creating recognizer ...
Started
Done!

exp/test_wavs/0.wav
{"text":"ALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTERS OF THE BROTHELS","timestamps":"[1.44, 1.48, 1.56, 1.72, 1.88, 1.96, 2.16, 2.28
, 2.36, 2.48, 2.60, 2.80, 3.08, 3.28, 3.40, 3.60, 3.80, 4.08, 4.24, 4.32, 4.48, 4.64, 4.84, 4.88, 5.00, 5.08, 5.12, 5.36, 5.48, 5.60, 5.68, 5.84, 6.04, 6.24]",
"tokens":["A","LL"," THE"," YE","LL","OW"," LA","M","P","S"," WOULD"," LIGHT"," UP"," HE","RE"," AND"," THERE"," THE"," S","QUA","LI","D"," ","QUA","R","TER",$
S"," OF"," THE"," B","RO","TH","EL","S"]}
----
exp/test_wavs/1.wav
{"text":"IN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONOURED BOSOM TO CONNECT HER PARENT FOREVER WITH THE RACE AN$
 DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN","timestamps":"[2.44, 2.64, 2.88, 3.16, 3.28, 3.48, 3.60, 3.80, 3.96, 4.12, 4.36, 4.52, 4.72, 4$
92, 5.16, 5.44, 5.68, 6.04, 6.24, 6.48, 6.84, 7.08, 7.32, 7.56, 7.84, 8.12, 8.24, 8.32, 8.44, 8.60, 8.76, 8.88, 9.08, 9.28, 9.44, 9.56, 9.64, 9.76, 9.96, 10.0$
, 10.20, 10.40, 10.68, 10.76, 11.04, 11.20, 11.36, 11.60, 11.80, 12.00, 12.12, 12.28, 12.32, 12.52, 12.72, 12.84, 12.96, 13.04, 13.24, 13.40, 13.60, 13.76, 13$
96, 14.12, 14.24, 14.36, 14.52, 14.68, 14.76, 15.04, 15.28, 15.52, 15.76, 16.00, 16.16, 16.24, 16.32]","tokens":["IN"," WHICH"," MAN"," TH","US"," P","UN","IS$
","ED"," HAD"," GIVE","N"," HER"," A"," LOVE","LY"," CHILD"," WHO","SE"," PLACE"," WAS"," ON"," THAT"," SAME"," DIS","HO","N","OUR","ED"," BO","S","OM"," TO",$
 CON","NE","C","T"," HER"," P","AR","ENT"," FOR","E","VER"," WITH"," THE"," RA","CE"," AND"," DE","S","C","ENT"," OF"," MO","R","T","AL","S"," AND"," TO"," BE$
," FI","N","AL","LY"," A"," B","LESS","ED"," SO","UL"," IN"," HE","A","VE","N"]}
----
num threads: 2
decoding method: modified_beam_search
max active paths: 4
Elapsed seconds: 1.522 s
Real time factor (RTF): 1.522 / 23.340 = 0.065

Hint

QUARTER -> QUARTERS

FOR EVER -> FOREVER

Python api

Decoding without hotwords

python python-api-examples/offline-decode-files.py \
    --encoder exp/encoder-epoch-99-avg-1.onnx \
    --decoder exp/decoder-epoch-99-avg-1.onnx \
    --joiner exp/joiner-epoch-99-avg-1.onnx \
    --decoding modified_beam_search \
    --tokens exp/tokens.txt \
    exp/test_wavs/0.wav exp/test_wavs/1.wav

The output is:

Started!
Done!
exp/test_wavs/0.wav
ALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS
----------
exp/test_wavs/1.wav
IN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONOURED BOSOM TO CONNECT HER PARENT FOR EVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN
----------
num_threads: 1
decoding_method: modified_beam_search
Wave duration: 23.340 s
Elapsed time: 2.546 s
Real time factor (RTF): 2.546/23.340 = 0.109

Decoding with hotwords

python python-api-examples/offline-decode-files.py \
    --encoder exp/encoder-epoch-99-avg-1.onnx \
    --decoder exp/decoder-epoch-99-avg-1.onnx \
    --joiner exp/joiner-epoch-99-avg-1.onnx \
    --decoding modified_beam_search \
    --tokens exp/tokens.txt \
    --modeling-unit bpe \
    --bpe-vocab exp/bpe.vocab \
    --hotwords-file hotwords_en.txt \
    --hotwords-score 2.0 \
    exp/test_wavs/0.wav exp/test_wavs/1.wav

The output is:

Started!
Done!
exp/test_wavs/0.wav
ALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTERS OF THE BROTHELS
----------
exp/test_wavs/1.wav
IN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONOURED BOSOM TO CONNECT HER PARENT FOREVER WITH THE RACE AND DESCENTOF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN
----------
num_threads: 1
decoding_method: modified_beam_search
Wave duration: 23.340 s
Elapsed time: 2.463 s
Real time factor (RTF): 2.463/23.340 = 0.106

Hint

QUARTER -> QUARTERS

FOR EVER -> FOREVER

Modeling unit is cjkchar

Download the model

cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-conformer-zh-stateless2-2023-05-23.tar.bz2
tar xvf sherpa-onnx-conformer-zh-stateless2-2023-05-23.tar.bz2
rm sherpa-onnx-conformer-zh-stateless2-2023-05-23.tar.bz2

ln -s sherpa-onnx-conformer-zh-stateless2-2023-05-23 exp-zh

The hotwords_cn.txt contains:

文森特卡索
周望君
朱丽楠
蒋有伯

C++ api

Decoding without hotwords

./build/bin/sherpa-onnx-offline \
    --encoder=exp-zh/encoder-epoch-99-avg-1.onnx \
    --decoder=exp-zh/decoder-epoch-99-avg-1.onnx \
    --joiner=exp-zh/joiner-epoch-99-avg-1.onnx \
    --tokens=exp-zh/tokens.txt \
    --decoding-method=modified_beam_search \
    exp-zh/test_wavs/3.wav exp-zh/test_wavs/4.wav exp-zh/test_wavs/5.wav exp-zh/test_wavs/6.wav

The output is:

/star-kw/kangwei/code/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./build/bin/sherpa-onnx-offline --encoder=exp-zh/encoder-epoch-99-avg-1.onnx --decoder=exp-zh/decoder-epoch-99-avg-1.onnx --joiner=exp-zh/joiner-epoch-99-avg-1.onnx --tokens=exp-zh/tokens.txt --decoding-method=modified_beam_search exp-zh/test_wavs/3.wav exp-zh/test_wavs/4.wav exp-zh/test_wavs/5.wav exp-zh/test_wavs/6.wav

OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="exp-zh/encoder-epoch-99-avg-1.onnx", decoder_filename="exp-zh/decoder-epoch-99-avg-1.onnx", joiner_filename="exp-zh/joiner-$poch-99-avg-1.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig$encoder="", decoder="", language="", task="transcribe"), tdnn=OfflineTdnnModelConfig(model=""), tokens="exp-zh/tokens.txt", num_threads=2, debug=False, provider="cpu", model_type=""), lm_config=OfflineLMConfig(model="", scale=0.5), decoding_method="modified_beam_search", max_active$paths=4, hotwords_file=, hotwords_score=1.5)
Creating recognizer ...
Started
Done!

exp-zh/test_wavs/3.wav
{"text":"文森特卡所是全球知名的法国性格派演员","timestamps":"[0.00, 0.16, 0.68, 1.32, 1.72, 2.08, 2.60, 2.88, 3.20, 3.52, 3.92, 4.40, 4.68, 5.12, 5.44, 6.36, $.96, 7.32]","tokens":["文","森","特","卡","所","是","全","球","知","名","的","法","国","性","格","派","演","员"]}
----
exp-zh/test_wavs/4.wav
{"text":"蒋友伯被拍到带着女儿出游","timestamps":"[0.00, 0.20, 0.88, 1.36, 1.76, 2.08, 2.28, 2.68, 2.92, 3.16, 3.44, 3.80]","tokens":["蒋","友","伯","被","拍",$
到","带","着","女","儿","出","游"]}
----
exp-zh/test_wavs/5.wav
{"text":"周望军就落实控物价","timestamps":"[0.00, 0.16, 0.88, 1.24, 1.64, 1.96, 2.76, 3.04, 3.32]","tokens":["周","望","军","就","落","实","控","物","价"]}
----
exp-zh/test_wavs/6.wav
{"text":"朱立南在上市见面会上表示","timestamps":"[0.00, 0.16, 0.80, 1.12, 1.44, 1.68, 1.92, 2.16, 2.36, 2.60, 2.84, 3.12]","tokens":["朱","立","南","在","上",$
市","见","面","会","上","表","示"]}
----
num threads: 2
decoding method: modified_beam_search
max active paths: 4
Elapsed seconds: 1.883 s
Real time factor (RTF): 1.883 / 20.328 = 0.093

Decoding with hotwords

./build/bin/sherpa-onnx-offline \
  --encoder=exp-zh/encoder-epoch-99-avg-1.onnx \
  --decoder=exp-zh/decoder-epoch-99-avg-1.onnx \
  --joiner=exp-zh/joiner-epoch-99-avg-1.onnx \
  --tokens=exp-zh/tokens.txt \
  --decoding-method=modified_beam_search \
  --modeling-unit=cjkchar \
  --hotwords-file=hotwords_cn.txt \
  --hotwords-score=2.0 \
  exp-zh/test_wavs/3.wav exp-zh/test_wavs/4.wav exp-zh/test_wavs/5.wav exp-zh/test_wavs/6.wav

OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="exp-zh/encoder-epoch-99-avg-1.onnx", decoder_filename="exp-zh/decoder-epoch-99-avg-1.onnx", joiner_filename="exp-zh/joiner-$poch-99-avg-1.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig$encoder="", decoder="", language="", task="transcribe"), tdnn=OfflineTdnnModelConfig(model=""), tokens="exp-zh/tokens.txt", num_threads=2, debug=False, provider="cpu", model_type=""), lm_config=OfflineLMConfig(model="", scale=0.5), decoding_method="modified_beam_search", max_active$paths=4, hotwords_file=hotwords_cn.txt, hotwords_score=2)
Creating recognizer ...
Started
Done!

exp-zh/test_wavs/3.wav
{"text":"文森特卡索是全球知名的法国性格派演员","timestamps":"[0.00, 0.16, 0.64, 1.28, 1.64, 2.04, 2.60, 2.88, 3.20, 3.52, 3.92, 4.40, 4.68, 5.12, 5.44, 6.36, $.96, 7.32]","tokens":["文","森","特","卡","索","是","全","球","知","名","的","法","国","性","格","派","演","员"]}
----
exp-zh/test_wavs/4.wav
{"text":"蒋有伯被拍到带着女儿出游","timestamps":"[0.00, 0.12, 0.80, 1.36, 1.76, 2.08, 2.28, 2.68, 2.92, 3.16, 3.44, 3.80]","tokens":["蒋","有","伯","被","拍",$
到","带","着","女","儿","出","游"]}
----
exp-zh/test_wavs/5.wav
{"text":"周望君就落实空物价","timestamps":"[0.00, 0.12, 0.80, 1.24, 1.56, 1.96, 2.68, 3.08, 3.32]","tokens":["周","望","君","就","落","实","空","物","价"]}
----
exp-zh/test_wavs/6.wav
{"text":"朱丽楠在上市见面会上表示","timestamps":"[0.00, 0.12, 0.80, 1.12, 1.44, 1.68, 1.92, 2.16, 2.36, 2.60, 2.84, 3.12]","tokens":["朱","丽","楠","在","上",$
市","见","面","会","上","表","示"]}
----
num threads: 2
decoding method: modified_beam_search
max active paths: 4
Elapsed seconds: 1.810 s
Real time factor (RTF): 1.810 / 20.328 = 0.089

Hint

文森特卡所 -> 文森特卡索

周望军 -> 周望君

朱立南 -> 朱丽楠

蒋友伯 -> 蒋有伯

Python api

Decoding without hotwords

python python-api-examples/offline-decode-files.py \
--encoder exp-zh/encoder-epoch-99-avg-1.onnx \
--decoder exp-zh/decoder-epoch-99-avg-1.onnx \
--joiner exp-zh/joiner-epoch-99-avg-1.onnx \
--tokens exp-zh/tokens.txt \
--decoding-method modified_beam_search \
exp-zh/test_wavs/3.wav exp-zh/test_wavs/4.wav exp-zh/test_wavs/5.wav exp-zh/test_wavs/6.wav

The output is:

Started!
Done!
exp-zh/test_wavs/3.wav
文森特卡所是全球知名的法国性格派演员
----------
exp-zh/test_wavs/4.wav
蒋友伯被拍到带着女儿出游
----------
exp-zh/test_wavs/5.wav
周望军就落实控物价
----------
exp-zh/test_wavs/6.wav
朱立南在上市见面会上表示
----------
num_threads: 1
decoding_method: modified_beam_search
Wave duration: 20.328 s
Elapsed time: 2.653 s
Real time factor (RTF): 2.653/20.328 = 0.131

Decoding with hotwords

python python-api-examples/offline-decode-files.py \
    --encoder exp-zh/encoder-epoch-99-avg-1.onnx \
    --decoder exp-zh/decoder-epoch-99-avg-1.onnx \
    --joiner exp-zh/joiner-epoch-99-avg-1.onnx \
    --tokens exp-zh/tokens.txt \
    --decoding-method modified_beam_search \
    --modeling-unit=cjkchar \
    --hotwords-file hotwords_cn.txt \
    --hotwords-score 2.0 \
    exp-zh/test_wavs/3.wav exp-zh/test_wavs/4.wav exp-zh/test_wavs/5.wav exp-zh/test_wavs/6.wav

The output is:

Started!
Done!
exp-zh/test_wavs/3.wav
文森特卡索是全球知名的法国性格派演员
----------
exp-zh/test_wavs/4.wav
蒋有伯被拍到带着女儿出游
----------
exp-zh/test_wavs/5.wav
周望君就落实空物价
----------
exp-zh/test_wavs/6.wav
朱丽楠在上市见面会上表示
----------
num_threads: 1
decoding_method: modified_beam_search
Wave duration: 20.328 s
Elapsed time: 2.636 s
Real time factor (RTF): 2.636/20.328 = 0.130

Hint

文森特卡所 -> 文森特卡索

周望军 -> 周望君

朱立南 -> 朱丽楠

蒋友伯 -> 蒋有伯

Modeling unit is cjkchar+bpe

Download the model

cd /path/to/sherpa-onnx

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20.tar.bz2
tar xvf sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20.tar.bz2
rm sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20.tar.bz2

ln -s sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20 exp-mixed

Export bpe.vocab if you can’t find bpe.vocab in the model directory.

python script/export_bpe_vocab.py --bpe-model exp/bpe.model

The hotwords_mix.txt contains:

礼拜二
频繁

C++ api

Decoding without hotwords

./build/bin/sherpa-onnx \
    --encoder=exp-mixed/encoder-epoch-99-avg-1.onnx \
    --decoder=exp-mixed/decoder-epoch-99-avg-1.onnx \
    --joiner=exp-mixed/joiner-epoch-99-avg-1.onnx \
    --decoding-method=modified_beam_search \
    --tokens=exp-mixed/tokens.txt \
    exp-mixed/test_wavs/0.wav exp-mixed/test_wavs/2.wav

The output is:

/star-kw/kangwei/code/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./build/bin/sherpa-onnx --encoder=exp-mixed/encoder-epoch-99-avg-1.onnx --decoder=exp-mixed/decoder-epoch-99-avg-1.onnx --joiner=exp-mixed/joiner-epoch-99-avg-1.onnx --decoding-method=modified_beam_search --tokens=exp-mixed/tokens.txt exp-mixed/test_wavs/0.wav exp-mixed/test_wavs/2.wav
OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OnlineModelConfig(transducer=OnlineTransducerModelConfig(encoder="exp-mixed/encoder-epoch-99-avg-1.onnx", decoder="exp-mixed/decoder-epoch-99-avg-1.onnx", joiner="exp-mixed/joiner-epoch-99-avg-1.onnx"), paraformer=OnlineParaformerModelConfig(encoder="", decoder=""), tokens="exp-mixed/tokens.txt", num_threads=1, debug=False, provider="cpu", model_type=""), lm_config=OnlineLMConfig(model="", scale=0.5), endpoint_config=EndpointConfig(rule1=EndpointRule(must_contain_nonsilence=False, min_trailing_silence=2.4, min_utterance_length=0), rule2=EndpointRule(must_contain_nonsilence=True, min_trailing_silence=1.2, min_utterance_length=0), rule3=EndpointRule(must_contain_nonsilence=False, min_trailing_silence=0, min_utterance_length=20)), enable_endpoint=True, max_active_paths=4, hotwords_score=1.5, hotwords_file="", decoding_method="modified_beam_search")

exp-mixed/test_wavs/0.wav
Elapsed seconds: 3, Real time factor (RTF): 0.3
昨天是 MONDAY TODAY IS LIBR THE DAY AFTER TOMORROW是星期三
{"is_final":false,"segment":0,"start_time":0.0,"text":"昨天是 MONDAY TODAY IS LIBR THE DAY AFTER TOMORROW是星期三","timestamps":"[0.64, 1.04, 1.60, 2.08, 2.20, 2.40, 4.16, 4.40, 4.88, 5.56, 5.80, 6.16, 6.84, 7.12, 7.44, 8.04, 8.16, 8.24, 8.28, 9.04, 9.40, 9.64, 9.88]","tokens":["昨","天","是"," MO","N","DAY"," TO","DAY"," IS"," LI","B","R"," THE"," DAY"," AFTER"," TO","M","OR","ROW","是","星","期","三"]}

exp-mixed/test_wavs/2.wav
Elapsed seconds: 1.7, Real time factor (RTF): 0.37
是不是平凡的啊不认识记下来 FREQUENTLY频繁的
{"is_final":false,"segment":0,"start_time":0.0,"text":"是不是平凡的啊不认识记下来 FREQUENTLY频繁的","timestamps":"[0.00, 0.40, 0.52, 0.96, 1.08, 1.28, 1.48, 1.68, 1.84, 2.00, 2.24, 2.36, 2.52, 2.68, 2.92, 3.00, 3.12, 3.32, 3.64, 3.96, 4.36]","tokens":["是","不","是","平","凡","的","啊","不","认","识","记","下","来"," F","RE","QU","ENT","LY","频","繁","的"]}

Decoding with hotwords

./build/bin/sherpa-onnx \
    --encoder=exp-mixed/encoder-epoch-99-avg-1.onnx \
    --decoder=exp-mixed/decoder-epoch-99-avg-1.onnx \
    --joiner=exp-mixed/joiner-epoch-99-avg-1.onnx \
    --decoding-method=modified_beam_search \
    --tokens=exp-mixed/tokens.txt \
    --modeling-unit=cjkchar+bpe \
    --bpe-vocab=exp/bpe.vocab \
    --hotwords-file=hotwords_mix.txt \
    --hotwords-score=2.0 \
    exp-mixed/test_wavs/0.wav exp-mixed/test_wavs/2.wav

The output is:

/star-kw/kangwei/code/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./build/bin/sherpa-onnx --encoder=exp-mixed/encoder-epoch-99-avg-1.onnx --decoder=exp-mixed/decoder-epoch-99-avg-1.onnx --joiner=exp-mixed/joiner-epoch-99-avg-1.onnx --decoding-method=modified_beam_search --tokens=exp-mixed/tokens.txt --tokens-type=cjkchar+bpe --bpe-model=exp-mixed/bpe.model --hotwords-file=hotwords_mix.txt --hotwords-score=2.0 exp-mixed/test_wavs/0.wav exp-mixed/test_wavs/2.wav

OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OnlineModelConfig(transducer=OnlineTransducerModelConfig(encoder="exp-mixed/encoder-epoch-99-avg-1.onnx", decoder="exp-mixed/decoder-epoch-99-avg-1.onnx", joiner="exp-mixed/joiner-epoch-99-avg-1.onnx"), paraformer=OnlineParaformerModelConfig(encoder="", decoder=""), tokens="exp-mixed/tokens.txt", num_threads=1, debug=False, provider="cpu", model_type=""), lm_config=OnlineLMConfig(model="", scale=0.5), endpoint_config=EndpointConfig(rule1=EndpointRule(must_contain_nonsilence=False, min_trailing_silence=2.4, min_utterance_length=0), rule2=EndpointRule(must_contain_nonsilence=True, min_trailing_silence=1.2, min_utterance_length=0), rule3=EndpointRule(must_contain_nonsilence=False, min_trailing_silence=0, min_utterance_length=20)), enable_endpoint=True, max_active_paths=4, hotwords_score=2, hotwords_file="hotwords_mix.txt", decoding_method="modified_beam_search")

exp-mixed/test_wavs/0.wav
Elapsed seconds: 3.2, Real time factor (RTF): 0.32
昨天是 MONDAY TODAY IS礼拜二 THE DAY AFTER TOMORROW是星期三
{"is_final":false,"segment":0,"start_time":0.0,"text":"昨天是 MONDAY TODAY IS礼拜二 THE DAY AFTER TOMORROW是星期三","timestamps":"[0.64, 1.04, 1.60, 2.08, 2.20, 2.40, 4.16, 4.40, 4.88, 5.56, 5.68, 6.00, 6.84, 7.12, 7.44, 8.04, 8.16, 8.24, 8.28, 9.04, 9.40, 9.64, 9.88]","tokens":["昨","天","是"," MO","N","DAY"," TO","DAY"," IS","礼","拜","二"," THE"," DAY"," AFTER"," TO","M","OR","ROW","是","星","期","三"]}

exp-mixed/test_wavs/2.wav
Elapsed seconds: 1.9, Real time factor (RTF): 0.4
是不是频繁的啊不认识记下来 FREQUENTLY频繁的
{"is_final":false,"segment":0,"start_time":0.0,"text":"是不是频繁的啊不认识记下来 FREQUENTLY频繁的","timestamps":"[0.00, 0.40, 0.52, 0.96, 1.08, 1.28, 1.48, 1.68, 1.84, 2.00, 2.24, 2.36, 2.52, 2.68, 2.92, 3.00, 3.12, 3.32, 3.64, 3.96, 4.36]","tokens":["是","不","是","频","繁","的","啊","不","认","识","记","下","来"," F","RE","QU","ENT","LY","频","繁","的"]}

Hint

LIBR -> 礼拜二

平凡 -> 频繁

Python api

Decoding without hotwords

python python-api-examples/online-decode-files.py \
  --encoderexp-mixed/encoder-epoch-99-avg-1.onnx \
  --decoder exp-mixed/decoder-epoch-99-avg-1.onnx \
  --joiner exp-mixed/joiner-epoch-99-avg-1.onnx \
  --decoding-method modified_beam_search \
  --tokens exp-mixed/tokens.txt
  exp-mixed/test_wavs/0.wav exp-mixed/test_wavs/2.wav

The output is:

Started!
Done!
exp-mixed/test_wavs/0.wav
昨天是 MONDAY TODAY IS LIBR THE DAY AFTER TOMORROW是星期三
----------
exp-mixed/test_wavs/2.wav
是不是平凡的啊不认识记下来 FREQUENTLY频繁的
----------
num_threads: 1
decoding_method: modified_beam_search
Wave duration: 14.743 s
Elapsed time: 3.052 s
Real time factor (RTF): 3.052/14.743 = 0.207

Decoding with hotwords

python python-api-examples/online-decode-files.py \
    --encoder exp-mixed/encoder-epoch-99-avg-1.onnx \
    --decoder exp-mixed/decoder-epoch-99-avg-1.onnx \
    --joiner exp-mixed/joiner-epoch-99-avg-1.onnx \
    --decoding-method modified_beam_search \
    --tokens exp-mixed/tokens.txt \
    --modeling-unit cjkchar+bpe \
    --bpe-vocab exp-mixed/bpe.vocab \
    --hotwords-file hotwords_mix.txt \
    --hotwords-score 2.0 \
    exp-mixed/test_wavs/0.wav exp-mixed/test_wavs/2.wav

The output is:

Started!
Done!
exp-mixed/test_wavs/0.wav
昨天是 MONDAY TODAY IS礼拜二 THE DAY AFTER TOMORROW是星期三
----------
exp-mixed/test_wavs/2.wav
是不是频繁的啊不认识记下来 FREQUENTLY频繁的
----------
num_threads: 1
decoding_method: modified_beam_search
Wave duration: 14.743 s
Elapsed time: 3.060 s
Real time factor (RTF): 3.060/14.743 = 0.208

Hint

LIBR -> 礼拜二

平凡 -> 频繁