Hotwords (Contextual biasing)

In this section, we describe how we implement the hotwords (aka contextual biasing) feature with an Aho-corasick automaton and how to use it in sherpa-onnx.

What are hotwords

Current ASR systems work very well for general cases, but they sometimes fail to recognize special words/phrases (aka hotwords) like rare words, personalized information etc. Usually, those words/phrases will be recognized as the words/phrases that pronounce similar to them (for example, recognize LOUIS FOURTEEN as LEWIS FOURTEEN). So we have to provide some kind of contexts information (for example, the LOUIS FOURTEEN) to the ASR systems to boost those words/phrases. Normally, we call this kind of boosting task contextual biasing (aka hotwords recognition).

How do we implement it with an Aho-corasick

We first construct an Aho-corasick automaton on those given hotwords (after tokenizing into tokens). Please refer to https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm for the construction details of Aho-corasick.

The figure below is the aho-corasick on “HE/SHE/SHELL/HIS/THIS” with hotwords-score==1.

The aho-corasick for "HE/SHE/SHELL/HIS/THIS"

Fig. 63 The Aho-corasick for “HE/SHE/SHELL/HIS/THIS”

The black arrows in the graph are the goto arcs, the red arrows are the failure arcs, the green arrows are the output arcs. On each goto arc, there are token and boosting score (Note: we will boost the path when any partial sequence is matched, if the path finally fails to full match any hotwords, the boosted score will be canceled). Currentlly, the boosting score distributes on the arcs evenly along the path. On each state, there are two scores, the first one is the node score (mainly used to cancel score) the second one is output score, the output score is the total scores of the full matched hotwords of this state.

The following are several matching examples of the graph above.

Note

For simplicity, we assume that the system emits a token each frame.

Hint

We have an extra finalize step to force the graph state to go back to the root state.

The path is “SHELF”

Frame

Boost score

Total boost score

Graph state

Matched hotwords

init

0

0

0

1

1

1

3

2

1

2

4

3

1 + 5

8

5

HE, SHE

4

1

9

6

5

-4

5

0

finalize

0

5

0

At frame 3 we reach state 5 and match HE, SHE, so we get a boosting score 1 + 5, the score 1 here because the SHEL still might be the prefix of other hotwords. At frame 5 F can not match any tokens and fail back to root, so we cancel the score for SHEL which is 4 (the node score of state 6).

The path is “HI”

Frame

Boost score

Total boost score

Graph state

Matched hotwords

init

0

0

0

1

1

1

1

2

1

2

8

finalize

-2

0

0

H and I all match the tokens in the graph, unfortunately, we have to go back to root state when finishing matching a path, so we cancel the boosting score of HI which is 2 (the node score of state 8).

The path is “THE”

Frame

Boost score

Total boost score

Graph state

Matched hotwords

init

0

0

0

1

1

1

10

2

1

2

11

3

0 + 2

4

2

HE

finalize

-2

3

0

At frame 3 we jump from state 11 to state 2 and get a boosting score of 0 + 2, 0 because the node score of state 2 is the same as state 11 so we don’t get score by partial match (the prefix of state 11 is TH has the same length of the prefix of state 2 which is HE), but we do get the output score (at state 2 it outputs HE).

Note

We implement the hotwords feature during inference time, you don’t have to re-train the models to use this feature.

How to use hotwords in sherpa-onnx

Caution

Currentlly, the hotwords feature is only supported in the modified_beam_search decoding method of the transducer models (both streaming and non-streaming).

The use of the hotwords is no different for streaming and non-streaming models, and in fact it is even no different for all the API supported by sherpa onnx. We add two extra arguments for hotwords:

  • hotwords-file

    The file path of the hotwords, one hotwords per line, for each hotwords the chars (for Chinese) and bpes are separated by spaces

    Caution

    The hotwords in hotwords-file should be tokenized to modeling units (i.e the symbols in tokens.txt

    For models trained on cjkchar it looks like:

       
       
    

    For models trained on bpe (English like languages) it looks like:

    ▁SP E E CH ▁RE C O G N ITION
    ▁DE E P ▁LE AR N ING
    

    For models trained on cjkchar+bpe it looks like:

    ▁SP E E CH 识 别
    ▁SP E E CH ▁RE CO G N ITION
    深 度 学 习
    
  • hotwords-score

    The boosting score for each matched token.

    Note

    We match the hotwords at token level, so the hotwords-score is applied at token level.

We provide a command line tool to convert hotwords to tokens, you can see the usage as follows:

sherpa-onnx-cli text2token --help
Usage: sherpa-onnx-cli text2token [OPTIONS] INPUT OUTPUT

Options:
  --tokens TEXT       The path to tokens.txt.
  --tokens-type TEXT  The type of modeling units, should be cjkchar, bpe or
                      cjkchar+bpe
  --bpe-model TEXT    The path to bpe.model.
  --help              Show this message and exit.

The tool has three options:

  • tokens

    The file path of the tokens.txt, you can find tokens.txt in any of our pretrained model.

  • tokens-type

    The modeling unit used to train the models. Three kinds of tokens_type are supported now, cjkchar, bpe and cjkchar+bpe. The tokens-type tells the systems how to encode hotwords into tokens.

  • bpe-model

    The file path of the bpe model used to generate the tokens.txt, it is also used to encode the hotwords. Only used when tokens-type is bpe or cjkchar+bpe.

Note

If you install sherpa-onnx from sources (i.e. not by pip), you can use the alternative script in scripts, the usage is almost the same as the command line tool, read the help information by:

python3 scripts/text2token.py --help

The main difference of using hotwords feature is about the modeling units (i.e. tokens_type). The following shows how to use it for different modeling units.

Hint

You can use any transducer models here https://k2-fsa.github.io/sherpa/onnx/pretrained_models/index.html, we just choose three of them randomly for the following examples.

Note

In the following example, we use a non-streaming model, if you are using a streaming model, you should use sherpa-onnx. sherpa-onnx-alsa, sherpa-onnx-microphone, sherpa-onnx-microphone-offline, sherpa-onnx-online-websocket-server and sherpa-onnx-offline-websocket-server all support hotwords.

Modeling unit is bpe

Download the model

cd /path/to/sherpa-onnx

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-zipformer-en-2023-04-01.tar.bz2
tar xvf sherpa-onnx-zipformer-en-2023-04-01.tar.bz2
rm sherpa-onnx-zipformer-en-2023-04-01.tar.bz2
ln -s sherpa-onnx-zipformer-en-2023-04-01 exp

Convert the hotwords into tokens

sherpa-onnx-cli text2token \
  --tokens exp/tokens.txt \
  --tokens-type bpe \
  --bpe-model exp/bpe.model \
  hotwords.txt hotwords_en.txt

The hotwords.txt contains:

QUARTERS
FOREVER

The hotwords_en.txt contains:

▁ QUA R TER S
▁FOR E VER

C++ api

Decoding without hotwords

./build/bin/sherpa-onnx-offline \
  --encoder=exp/encoder-epoch-99-avg-1.onnx \
  --decoder=exp/decoder-epoch-99-avg-1.onnx \
  --joiner=exp/joiner-epoch-99-avg-1.onnx \
  --decoding-method=modified_beam_search \
  --tokens=exp/tokens.txt \
  exp/test_wavs/0.wav exp/test_wavs/1.wav

The output is:

/star-kw/kangwei/code/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./build/bin/sherpa-onnx-offline --encoder=exp/encoder-epoch-99-avg-1.onnx --decoder=exp/decoder-epoch-99-avg-1.onnx --joiner=exp/joiner-epoch-99-avg-1.onnx --decoding-method=modified_beam_search --tokens=exp/tokens.txt exp/test_wavs/0.wav exp/test_wavs/1.wav

OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OfflineModelConfig(transducer=OfflineTran$ducerModelConfig(encoder_filename="exp/encoder-epoch-99-avg-1.onnx", decoder_filename="exp/decoder-epoch-99-avg-1.onnx", joiner_filename="exp/joiner-epoch-99-$vg-1.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder=$", decoder="", language="", task="transcribe"), tdnn=OfflineTdnnModelConfig(model=""), tokens="exp/tokens.txt", num_threads$2, debug=False, provider="cpu", model_type=""), lm_config=OfflineLMConfig(model="", scale=0.5), decoding_method="modified_beam_search", max_active_paths=4, ho$words_file=, hotwords_score=1.5)
Creating recognizer ...
Started
Done!

exp/test_wavs/0.wav
{"text":"ALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS","timestamps":"[1.44, 1.48, 1.56, 1.72, 1.88, 1.96, 2.16, 2.28$ 2.36, 2.48, 2.60, 2.80, 3.08, 3.28, 3.40, 3.60, 3.80, 4.08, 4.24, 4.32, 4.48, 4.64, 4.84, 4.88, 5.00, 5.08, 5.32, 5.48, 5.60, 5.68, 5.84, 6.04, 6.24]","token$":["A","LL"," THE"," YE","LL","OW"," LA","M","P","S"," WOULD"," LIGHT"," UP"," HE","RE"," AND"," THERE"," THE"," S","QUA","LI","D"," ","QUA","R","TER"," OF","THE"," B","RO","TH","EL","S"]}
----
exp/test_wavs/1.wav
{"text":"IN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONOURED BOSOM TO CONNECT HER PARENT FOR EVER WITH THE RACE AN
D DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN","timestamps":"[2.44, 2.64, 2.88, 3.16, 3.28, 3.48, 3.60, 3.80, 3.96, 4.12, 4.36, 4.52, 4.72, 4
.92, 5.16, 5.44, 5.68, 6.04, 6.24, 6.48, 6.84, 7.08, 7.32, 7.56, 7.84, 8.12, 8.24, 8.32, 8.44, 8.60, 8.76, 8.88, 9.08, 9.28, 9.44, 9.56, 9.64, 9.76, 9.96, 10.0
4, 10.20, 10.40, 10.64, 10.76, 11.04, 11.20, 11.36, 11.60, 11.80, 12.00, 12.12, 12.28, 12.32, 12.52, 12.72, 12.84, 12.96, 13.04, 13.24, 13.40, 13.60, 13.76, 13
.96, 14.12, 14.24, 14.36, 14.52, 14.68, 14.76, 15.04, 15.28, 15.52, 15.76, 16.00, 16.16, 16.24, 16.32]","tokens":["IN"," WHICH"," MAN"," TH","US"," P","UN","IS
H","ED"," HAD"," GIVE","N"," HER"," A"," LOVE","LY"," CHILD"," WHO","SE"," PLACE"," WAS"," ON"," THAT"," SAME"," DIS","HO","N","OUR","ED"," BO","S","OM"," TO",
" CON","NE","C","T"," HER"," P","AR","ENT"," FOR"," E","VER"," WITH"," THE"," RA","CE"," AND"," DE","S","C","ENT"," OF"," MO","R","T","AL","S"," AND"," TO"," B
E"," FI","N","AL","LY"," A"," B","LESS","ED"," SO","UL"," IN"," HE","A","VE","N"]}
----
num threads: 2
decoding method: modified_beam_search
max active paths: 4
Elapsed seconds: 1.775 s
Real time factor (RTF): 1.775 / 23.340 = 0.076

Decoding with hotwords

./build/bin/sherpa-onnx-offline \
    --encoder=exp/encoder-epoch-99-avg-1.onnx \
    --decoder=exp/decoder-epoch-99-avg-1.onnx \
    --joiner=exp/joiner-epoch-99-avg-1.onnx \
    --decoding-method=modified_beam_search \
    --tokens=exp/tokens.txt \
    --hotwords-file=hotwords_en.txt \
    --hotwords-score=2.0 \
    exp/test_wavs/0.wav exp/test_wavs/1.wav

The output is:

/star-kw/kangwei/code/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./build/bin/sherpa-onnx-offline --encoder=exp/encoder-epoch-99-avg-1.onnx --decoder=exp/decoder-epoch-99-avg-1.onnx --joiner=exp/joiner-epoch-99-avg-1.onnx --decoding-method=modified_beam_search --tokens=exp/tokens.txt --hotwords-file=hotwords_en.txt --hotwords-score=2.0 exp/test_wavs/0.wav exp/test_wavs/1.wav

OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="exp/encoder-epoch-99-avg-1.onnx", decoder_filename="exp/decoder-epoch-99-avg-1.onnx", joiner_filename="exp/joiner-epoch-99-avg-1.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="
", decoder="", language="", task="transcribe"), tdnn=OfflineTdnnModelConfig(model=""), tokens="exp/tokens.txt", num_threads=2, debug=False, provider="cpu", model_type=""), lm_config=OfflineLMConfig(model="", scale=0.5), decoding_method="modified_beam_search", max_active_paths=4, hotwords_file=hotwords_en.txt, hotwords_score=2)
Creating recognizer ...
Started
Done!

exp/test_wavs/0.wav
{"text":"ALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTERS OF THE BROTHELS","timestamps":"[1.44, 1.48, 1.56, 1.72, 1.88, 1.96, 2.16, 2.28
, 2.36, 2.48, 2.60, 2.80, 3.08, 3.28, 3.40, 3.60, 3.80, 4.08, 4.24, 4.32, 4.48, 4.64, 4.84, 4.88, 5.00, 5.08, 5.12, 5.36, 5.48, 5.60, 5.68, 5.84, 6.04, 6.24]",
"tokens":["A","LL"," THE"," YE","LL","OW"," LA","M","P","S"," WOULD"," LIGHT"," UP"," HE","RE"," AND"," THERE"," THE"," S","QUA","LI","D"," ","QUA","R","TER",$
S"," OF"," THE"," B","RO","TH","EL","S"]}
----
exp/test_wavs/1.wav
{"text":"IN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONOURED BOSOM TO CONNECT HER PARENT FOREVER WITH THE RACE AN$
 DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN","timestamps":"[2.44, 2.64, 2.88, 3.16, 3.28, 3.48, 3.60, 3.80, 3.96, 4.12, 4.36, 4.52, 4.72, 4$
92, 5.16, 5.44, 5.68, 6.04, 6.24, 6.48, 6.84, 7.08, 7.32, 7.56, 7.84, 8.12, 8.24, 8.32, 8.44, 8.60, 8.76, 8.88, 9.08, 9.28, 9.44, 9.56, 9.64, 9.76, 9.96, 10.0$
, 10.20, 10.40, 10.68, 10.76, 11.04, 11.20, 11.36, 11.60, 11.80, 12.00, 12.12, 12.28, 12.32, 12.52, 12.72, 12.84, 12.96, 13.04, 13.24, 13.40, 13.60, 13.76, 13$
96, 14.12, 14.24, 14.36, 14.52, 14.68, 14.76, 15.04, 15.28, 15.52, 15.76, 16.00, 16.16, 16.24, 16.32]","tokens":["IN"," WHICH"," MAN"," TH","US"," P","UN","IS$
","ED"," HAD"," GIVE","N"," HER"," A"," LOVE","LY"," CHILD"," WHO","SE"," PLACE"," WAS"," ON"," THAT"," SAME"," DIS","HO","N","OUR","ED"," BO","S","OM"," TO",$
 CON","NE","C","T"," HER"," P","AR","ENT"," FOR","E","VER"," WITH"," THE"," RA","CE"," AND"," DE","S","C","ENT"," OF"," MO","R","T","AL","S"," AND"," TO"," BE$
," FI","N","AL","LY"," A"," B","LESS","ED"," SO","UL"," IN"," HE","A","VE","N"]}
----
num threads: 2
decoding method: modified_beam_search
max active paths: 4
Elapsed seconds: 1.522 s
Real time factor (RTF): 1.522 / 23.340 = 0.065

Hint

QUARTER -> QUARTERS

FOR EVER -> FOREVER

Python api

Decoding without hotwords

python python-api-examples/offline-decode-files.py \
    --encoder=exp/encoder-epoch-99-avg-1.onnx \
    --decoder=exp/decoder-epoch-99-avg-1.onnx \
    --joiner=exp/joiner-epoch-99-avg-1.onnx \
    --decoding=modified_beam_search \
    --tokens=exp/tokens.txt \
    exp/test_wavs/0.wav exp/test_wavs/1.wav

The output is:

Started!
Done!
exp/test_wavs/0.wav
ALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS
----------
exp/test_wavs/1.wav
IN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONOURED BOSOM TO CONNECT HER PARENT FOR EVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN
----------
num_threads: 1
decoding_method: modified_beam_search
Wave duration: 23.340 s
Elapsed time: 2.546 s
Real time factor (RTF): 2.546/23.340 = 0.109

Decoding with hotwords

python python-api-examples/offline-decode-files.py \
    --encoder=exp/encoder-epoch-99-avg-1.onnx \
    --decoder=exp/decoder-epoch-99-avg-1.onnx \
    --joiner=exp/joiner-epoch-99-avg-1.onnx \
    --decoding=modified_beam_search \
    --tokens=exp/tokens.txt \
    --hotwords-file=hotwords_en.txt \
    --hotwords-score=2.0 \
    exp/test_wavs/0.wav exp/test_wavs/1.wav

The output is:

Started!
Done!
exp/test_wavs/0.wav
ALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTERS OF THE BROTHELS
----------
exp/test_wavs/1.wav
IN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONOURED BOSOM TO CONNECT HER PARENT FOREVER WITH THE RACE AND DESCENTOF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN
----------
num_threads: 1
decoding_method: modified_beam_search
Wave duration: 23.340 s
Elapsed time: 2.463 s
Real time factor (RTF): 2.463/23.340 = 0.106

Hint

QUARTER -> QUARTERS

FOR EVER -> FOREVER

Modeling unit is cjkchar

Download the model

cd /path/to/sherpa-onnx
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/csukuangfj/sherpa-onnx-conformer-zh-stateless2-2023-05-23
cd sherpa-onnx-conformer-zh-stateless2-2023-05-23
git lfs pull --include "*.onnx"
cd ..
ln -s sherpa-onnx-conformer-zh-stateless2-2023-05-23 exp-zh

Convert the hotwords into tokens

sherpa-onnx-cli text2token \
  --tokens exp-zh/tokens.txt \
  --tokens-type cjkchar \
  hotwords.txt hotwords_cn.txt

The hotwords.txt contains:

文森特卡索
周望君
朱丽楠
蒋有伯

The hotwords_cn.txt contains:

    
  
  
  

C++ api

Decoding without hotwords

./build/bin/sherpa-onnx-offline \
    --encoder=exp-zh/encoder-epoch-99-avg-1.onnx \
    --decoder=exp-zh/decoder-epoch-99-avg-1.onnx \
    --joiner=exp-zh/joiner-epoch-99-avg-1.onnx \
    --tokens=exp-zh/tokens.txt \
    --decoding-method=modified_beam_search \
    exp-zh/test_wavs/3.wav exp-zh/test_wavs/4.wav exp-zh/test_wavs/5.wav exp-zh/test_wavs/6.wav

The output is:

/star-kw/kangwei/code/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./build/bin/sherpa-onnx-offline --encoder=exp-zh/encoder-epoch-99-avg-1.onnx --decoder=exp-zh/decoder-epoch-99-avg-1.onnx --joiner=exp-zh/joiner-epoch-99-avg-1.onnx --tokens=exp-zh/tokens.txt --decoding-method=modified_beam_search exp-zh/test_wavs/3.wav exp-zh/test_wavs/4.wav exp-zh/test_wavs/5.wav exp-zh/test_wavs/6.wav

OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="exp-zh/encoder-epoch-99-avg-1.onnx", decoder_filename="exp-zh/decoder-epoch-99-avg-1.onnx", joiner_filename="exp-zh/joiner-$poch-99-avg-1.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig$encoder="", decoder="", language="", task="transcribe"), tdnn=OfflineTdnnModelConfig(model=""), tokens="exp-zh/tokens.txt", num_threads=2, debug=False, provider="cpu", model_type=""), lm_config=OfflineLMConfig(model="", scale=0.5), decoding_method="modified_beam_search", max_active$paths=4, hotwords_file=, hotwords_score=1.5)
Creating recognizer ...
Started
Done!

exp-zh/test_wavs/3.wav
{"text":"文森特卡所是全球知名的法国性格派演员","timestamps":"[0.00, 0.16, 0.68, 1.32, 1.72, 2.08, 2.60, 2.88, 3.20, 3.52, 3.92, 4.40, 4.68, 5.12, 5.44, 6.36, $.96, 7.32]","tokens":["文","森","特","卡","所","是","全","球","知","名","的","法","国","性","格","派","演","员"]}
----
exp-zh/test_wavs/4.wav
{"text":"蒋友伯被拍到带着女儿出游","timestamps":"[0.00, 0.20, 0.88, 1.36, 1.76, 2.08, 2.28, 2.68, 2.92, 3.16, 3.44, 3.80]","tokens":["蒋","友","伯","被","拍",$
到","带","着","女","儿","出","游"]}
----
exp-zh/test_wavs/5.wav
{"text":"周望军就落实控物价","timestamps":"[0.00, 0.16, 0.88, 1.24, 1.64, 1.96, 2.76, 3.04, 3.32]","tokens":["周","望","军","就","落","实","控","物","价"]}
----
exp-zh/test_wavs/6.wav
{"text":"朱立南在上市见面会上表示","timestamps":"[0.00, 0.16, 0.80, 1.12, 1.44, 1.68, 1.92, 2.16, 2.36, 2.60, 2.84, 3.12]","tokens":["朱","立","南","在","上",$
市","见","面","会","上","表","示"]}
----
num threads: 2
decoding method: modified_beam_search
max active paths: 4
Elapsed seconds: 1.883 s
Real time factor (RTF): 1.883 / 20.328 = 0.093

Decoding with hotwords

./build/bin/sherpa-onnx-offline --encoder=exp-zh/encoder-epoch-99-avg-1.onnx --decoder=exp-zh/decoder-epoch-99-avg-1.onnx --joiner=exp-zh/joiner-epoch-99-avg-1.onnx --tokens=exp-zh/tokens.txt --decoding-method=modified_beam_search --hotwords-file=hotwords_cn.txt --hotwords-score=2.0 exp-zh/test_wavs/3.wav exp-zh/test_wavs/4.wav exp-zh/test_wavs/5.wav exp-zh/test_wavs/6.wav

OfflineRecognizerConfig(feat_config=OfflineFeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="exp-zh/encoder-epoch-99-avg-1.onnx", decoder_filename="exp-zh/decoder-epoch-99-avg-1.onnx", joiner_filename="exp-zh/joiner-$poch-99-avg-1.onnx"), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig$encoder="", decoder="", language="", task="transcribe"), tdnn=OfflineTdnnModelConfig(model=""), tokens="exp-zh/tokens.txt", num_threads=2, debug=False, provider="cpu", model_type=""), lm_config=OfflineLMConfig(model="", scale=0.5), decoding_method="modified_beam_search", max_active$paths=4, hotwords_file=hotwords_cn.txt, hotwords_score=2)
Creating recognizer ...
Started
Done!

exp-zh/test_wavs/3.wav
{"text":"文森特卡索是全球知名的法国性格派演员","timestamps":"[0.00, 0.16, 0.64, 1.28, 1.64, 2.04, 2.60, 2.88, 3.20, 3.52, 3.92, 4.40, 4.68, 5.12, 5.44, 6.36, $.96, 7.32]","tokens":["文","森","特","卡","索","是","全","球","知","名","的","法","国","性","格","派","演","员"]}
----
exp-zh/test_wavs/4.wav
{"text":"蒋有伯被拍到带着女儿出游","timestamps":"[0.00, 0.12, 0.80, 1.36, 1.76, 2.08, 2.28, 2.68, 2.92, 3.16, 3.44, 3.80]","tokens":["蒋","有","伯","被","拍",$
到","带","着","女","儿","出","游"]}
----
exp-zh/test_wavs/5.wav
{"text":"周望君就落实空物价","timestamps":"[0.00, 0.12, 0.80, 1.24, 1.56, 1.96, 2.68, 3.08, 3.32]","tokens":["周","望","君","就","落","实","空","物","价"]}
----
exp-zh/test_wavs/6.wav
{"text":"朱丽楠在上市见面会上表示","timestamps":"[0.00, 0.12, 0.80, 1.12, 1.44, 1.68, 1.92, 2.16, 2.36, 2.60, 2.84, 3.12]","tokens":["朱","丽","楠","在","上",$
市","见","面","会","上","表","示"]}
----
num threads: 2
decoding method: modified_beam_search
max active paths: 4
Elapsed seconds: 1.810 s
Real time factor (RTF): 1.810 / 20.328 = 0.089

Hint

文森特卡所 -> 文森特卡索

周望军 -> 周望君

朱立南 -> 朱丽楠

蒋友伯 -> 蒋有伯

Python api

Decoding without hotwords

python python-api-examples/offline-decode-files.py \
--encoder exp-zh/encoder-epoch-99-avg-1.onnx \
--decoder exp-zh/decoder-epoch-99-avg-1.onnx \
--joiner exp-zh/joiner-epoch-99-avg-1.onnx \
--tokens exp-zh/tokens.txt \
--decoding-method modified_beam_search \
exp-zh/test_wavs/3.wav exp-zh/test_wavs/4.wav exp-zh/test_wavs/5.wav exp-zh/test_wavs/6.wav

The output is:

Started!
Done!
exp-zh/test_wavs/3.wav
文森特卡所是全球知名的法国性格派演员
----------
exp-zh/test_wavs/4.wav
蒋友伯被拍到带着女儿出游
----------
exp-zh/test_wavs/5.wav
周望军就落实控物价
----------
exp-zh/test_wavs/6.wav
朱立南在上市见面会上表示
----------
num_threads: 1
decoding_method: modified_beam_search
Wave duration: 20.328 s
Elapsed time: 2.653 s
Real time factor (RTF): 2.653/20.328 = 0.131

Decoding with hotwords

python python-api-examples/offline-decode-files.py \
    --encoder exp-zh/encoder-epoch-99-avg-1.onnx \
    --decoder exp-zh/decoder-epoch-99-avg-1.onnx \
    --joiner exp-zh/joiner-epoch-99-avg-1.onnx \
    --tokens exp-zh/tokens.txt \
    --decoding-method modified_beam_search \
    --hotwords-file hotwords_cn.txt \
    --hotwords-score 2.0 \
    exp-zh/test_wavs/3.wav exp-zh/test_wavs/4.wav exp-zh/test_wavs/5.wav exp-zh/test_wavs/6.wav

The output is:

Started!
Done!
exp-zh/test_wavs/3.wav
文森特卡索是全球知名的法国性格派演员
----------
exp-zh/test_wavs/4.wav
蒋有伯被拍到带着女儿出游
----------
exp-zh/test_wavs/5.wav
周望君就落实空物价
----------
exp-zh/test_wavs/6.wav
朱丽楠在上市见面会上表示
----------
num_threads: 1
decoding_method: modified_beam_search
Wave duration: 20.328 s
Elapsed time: 2.636 s
Real time factor (RTF): 2.636/20.328 = 0.130

Hint

文森特卡所 -> 文森特卡索

周望军 -> 周望君

朱立南 -> 朱丽楠

蒋友伯 -> 蒋有伯

Modeling unit is cjkchar+bpe

Download the model

cd /path/to/sherpa-onnx

GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/csukuangfj/sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20
cd sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20
git lfs pull --include "*.onnx"
git lfs pull --include "bpe.model"
cd ..
ln -s sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20 exp-mixed

Convert the hotwords into tokens

sherpa-onnx-cli text2token \
  --tokens exp-mixed/tokens.txt \
  --tokens-type cjkchar+bpe \
  --bpe-model exp-mixed/bpe.model \
  hotwords.txt hotwords_mix.txt

The hotwords.txt contains:

礼拜二
频繁

The hotwords_mix.txt contains:

  
 

C++ api

Decoding without hotwords

./build/bin/sherpa-onnx \
    --encoder=exp-mixed/encoder-epoch-99-avg-1.onnx \
    --decoder=exp-mixed/decoder-epoch-99-avg-1.onnx \
    --joiner=exp-mixed/joiner-epoch-99-avg-1.onnx \
    --decoding-method=modified_beam_search \
    --tokens=exp-mixed/tokens.txt \
    exp-mixed/test_wavs/0.wav exp-mixed/test_wavs/2.wav

The output is:

/star-kw/kangwei/code/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./build/bin/sherpa-onnx --encoder=exp-mixed/encoder-epoch-99-avg-1.onnx --decoder=exp-mixed/decoder-epoch-99-avg-1.onnx --joiner=exp-mixed/joiner-epoch-99-avg-1.onnx --decoding-method=modified_beam_search --tokens=exp-mixed/tokens.txt exp-mixed/test_wavs/0.wav exp-mixed/test_wavs/2.wav
OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OnlineModelConfig(transducer=OnlineTransducerModelConfig(encoder="exp-mixed/encoder-epoch-99-avg-1.onnx", decoder="exp-mixed/decoder-epoch-99-avg-1.onnx", joiner="exp-mixed/joiner-epoch-99-avg-1.onnx"), paraformer=OnlineParaformerModelConfig(encoder="", decoder=""), tokens="exp-mixed/tokens.txt", num_threads=1, debug=False, provider="cpu", model_type=""), lm_config=OnlineLMConfig(model="", scale=0.5), endpoint_config=EndpointConfig(rule1=EndpointRule(must_contain_nonsilence=False, min_trailing_silence=2.4, min_utterance_length=0), rule2=EndpointRule(must_contain_nonsilence=True, min_trailing_silence=1.2, min_utterance_length=0), rule3=EndpointRule(must_contain_nonsilence=False, min_trailing_silence=0, min_utterance_length=20)), enable_endpoint=True, max_active_paths=4, hotwords_score=1.5, hotwords_file="", decoding_method="modified_beam_search")

exp-mixed/test_wavs/0.wav
Elapsed seconds: 3, Real time factor (RTF): 0.3
昨天是 MONDAY TODAY IS LIBR THE DAY AFTER TOMORROW是星期三
{"is_final":false,"segment":0,"start_time":0.0,"text":"昨天是 MONDAY TODAY IS LIBR THE DAY AFTER TOMORROW是星期三","timestamps":"[0.64, 1.04, 1.60, 2.08, 2.20, 2.40, 4.16, 4.40, 4.88, 5.56, 5.80, 6.16, 6.84, 7.12, 7.44, 8.04, 8.16, 8.24, 8.28, 9.04, 9.40, 9.64, 9.88]","tokens":["昨","天","是"," MO","N","DAY"," TO","DAY"," IS"," LI","B","R"," THE"," DAY"," AFTER"," TO","M","OR","ROW","是","星","期","三"]}

exp-mixed/test_wavs/2.wav
Elapsed seconds: 1.7, Real time factor (RTF): 0.37
是不是平凡的啊不认识记下来 FREQUENTLY频繁的
{"is_final":false,"segment":0,"start_time":0.0,"text":"是不是平凡的啊不认识记下来 FREQUENTLY频繁的","timestamps":"[0.00, 0.40, 0.52, 0.96, 1.08, 1.28, 1.48, 1.68, 1.84, 2.00, 2.24, 2.36, 2.52, 2.68, 2.92, 3.00, 3.12, 3.32, 3.64, 3.96, 4.36]","tokens":["是","不","是","平","凡","的","啊","不","认","识","记","下","来"," F","RE","QU","ENT","LY","频","繁","的"]}

Decoding with hotwords

./build/bin/sherpa-onnx \
    --encoder=exp-mixed/encoder-epoch-99-avg-1.onnx \
    --decoder=exp-mixed/decoder-epoch-99-avg-1.onnx \
    --joiner=exp-mixed/joiner-epoch-99-avg-1.onnx \
    --decoding-method=modified_beam_search \
    --tokens=exp-mixed/tokens.txt \
    --hotwords-file=hotwords_mix.txt \
    --hotwords-score=2.0 \
    exp-mixed/test_wavs/0.wav exp-mixed/test_wavs/2.wav

The output is:

/star-kw/kangwei/code/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./build/bin/sherpa-onnx --encoder=exp-mixed/encoder-epoch-99-avg-1.onnx --decoder=exp-mixed/decoder-epoch-99-avg-1.onnx --joiner=exp-mixed/joiner-epoch-99-avg-1.onnx --decoding-method=modified_beam_search --tokens=exp-mixed/tokens.txt --tokens-type=cjkchar+bpe --bpe-model=exp-mixed/bpe.model --hotwords-file=hotwords_mix.txt --hotwords-score=2.0 exp-mixed/test_wavs/0.wav exp-mixed/test_wavs/2.wav

OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=OnlineModelConfig(transducer=OnlineTransducerModelConfig(encoder="exp-mixed/encoder-epoch-99-avg-1.onnx", decoder="exp-mixed/decoder-epoch-99-avg-1.onnx", joiner="exp-mixed/joiner-epoch-99-avg-1.onnx"), paraformer=OnlineParaformerModelConfig(encoder="", decoder=""), tokens="exp-mixed/tokens.txt", num_threads=1, debug=False, provider="cpu", model_type=""), lm_config=OnlineLMConfig(model="", scale=0.5), endpoint_config=EndpointConfig(rule1=EndpointRule(must_contain_nonsilence=False, min_trailing_silence=2.4, min_utterance_length=0), rule2=EndpointRule(must_contain_nonsilence=True, min_trailing_silence=1.2, min_utterance_length=0), rule3=EndpointRule(must_contain_nonsilence=False, min_trailing_silence=0, min_utterance_length=20)), enable_endpoint=True, max_active_paths=4, hotwords_score=2, hotwords_file="hotwords_mix.txt", decoding_method="modified_beam_search")

exp-mixed/test_wavs/0.wav
Elapsed seconds: 3.2, Real time factor (RTF): 0.32
昨天是 MONDAY TODAY IS礼拜二 THE DAY AFTER TOMORROW是星期三
{"is_final":false,"segment":0,"start_time":0.0,"text":"昨天是 MONDAY TODAY IS礼拜二 THE DAY AFTER TOMORROW是星期三","timestamps":"[0.64, 1.04, 1.60, 2.08, 2.20, 2.40, 4.16, 4.40, 4.88, 5.56, 5.68, 6.00, 6.84, 7.12, 7.44, 8.04, 8.16, 8.24, 8.28, 9.04, 9.40, 9.64, 9.88]","tokens":["昨","天","是"," MO","N","DAY"," TO","DAY"," IS","礼","拜","二"," THE"," DAY"," AFTER"," TO","M","OR","ROW","是","星","期","三"]}

exp-mixed/test_wavs/2.wav
Elapsed seconds: 1.9, Real time factor (RTF): 0.4
是不是频繁的啊不认识记下来 FREQUENTLY频繁的
{"is_final":false,"segment":0,"start_time":0.0,"text":"是不是频繁的啊不认识记下来 FREQUENTLY频繁的","timestamps":"[0.00, 0.40, 0.52, 0.96, 1.08, 1.28, 1.48, 1.68, 1.84, 2.00, 2.24, 2.36, 2.52, 2.68, 2.92, 3.00, 3.12, 3.32, 3.64, 3.96, 4.36]","tokens":["是","不","是","频","繁","的","啊","不","认","识","记","下","来"," F","RE","QU","ENT","LY","频","繁","的"]}

Hint

LIBR -> 礼拜二

平凡 -> 频繁

Python api

Decoding without hotwords

python python-api-examples/online-decode-files.py \
  --encoderexp-mixed/encoder-epoch-99-avg-1.onnx \
  --decoder exp-mixed/decoder-epoch-99-avg-1.onnx \
  --joiner exp-mixed/joiner-epoch-99-avg-1.onnx \
  --decoding-method modified_beam_search \
  --tokens exp-mixed/tokens.txt
  exp-mixed/test_wavs/0.wav exp-mixed/test_wavs/2.wav

The output is:

Started!
Done!
exp-mixed/test_wavs/0.wav
昨天是 MONDAY TODAY IS LIBR THE DAY AFTER TOMORROW是星期三
----------
exp-mixed/test_wavs/2.wav
是不是平凡的啊不认识记下来 FREQUENTLY频繁的
----------
num_threads: 1
decoding_method: modified_beam_search
Wave duration: 14.743 s
Elapsed time: 3.052 s
Real time factor (RTF): 3.052/14.743 = 0.207

Decoding with hotwords

python python-api-examples/online-decode-files.py \
    --encoder exp-mixed/encoder-epoch-99-avg-1.onnx \
    --decoder exp-mixed/decoder-epoch-99-avg-1.onnx \
    --joiner exp-mixed/joiner-epoch-99-avg-1.onnx \
    --decoding-method modified_beam_search \
    --tokens exp-mixed/tokens.txt \
    --hotwords-file hotwords_mix.txt \
    --hotwords-score 2.0 \
    exp-mixed/test_wavs/0.wav exp-mixed/test_wavs/2.wav

The output is:

Started!
Done!
exp-mixed/test_wavs/0.wav
昨天是 MONDAY TODAY IS礼拜二 THE DAY AFTER TOMORROW是星期三
----------
exp-mixed/test_wavs/2.wav
是不是频繁的啊不认识记下来 FREQUENTLY频繁的
----------
num_threads: 1
decoding_method: modified_beam_search
Wave duration: 14.743 s
Elapsed time: 3.060 s
Real time factor (RTF): 3.060/14.743 = 0.208

Hint

LIBR -> 礼拜二

平凡 -> 频繁