Endpointing
We have three rules for endpoint detection. If any of them is activated, we assume an endpoint is detected.
Note
We borrow the implementation from
https://kaldi-asr.org/doc/structkaldi_1_1OnlineEndpointRule.html
Rule 1
In Rule 1
, we count the duration of trailing silence. If it is larger than
a user specified value, Rule 1
is activated. The following is an example,
which uses 2.4 seconds
as the threshold.
Two cases are given:
In the first case, nothing has been decoded when the duration of trailing silence reaches 2.4 seconds.
In the second case, we first decode something before the duration of trailing silence reaches 2.4 seconds.
In both cases, Rule 1
is activated.
Hint
In the Python API, you can specify rule1_min_trailing_silence
while
constructing an instance of sherpa_ncnn.Recognizer
.
In the C++ API, you can specify rule1.min_trailing_silence
when creating
EndpointConfig
.
Rule 2
In Rule 2
, we require that it has to first decode something
before we count the trailing silence. In the following example, after decoding
something, Rule 2
is activated when the duration of trailing silence is
larger than the user specified value 1.2
seconds.
Hint
In the Python API, you can specify rule2_min_trailing_silence
while
constructing an instance of sherpa_ncnn.Recognizer
.
In the C++ API, you can specify rule2.min_trailing_silence
when creating
EndpointConfig
.
Rule 3
Rule 3
is activated when the utterance length in seconds is larger than
a given value. In the following example, Rule 3
is activated after the
first segment reaches a given value, which is 20
seconds in this case.
Hint
In the Python API, you can specify rule3_min_utterance_length
while
constructing an instance of sherpa_ncnn.Recognizer
.
In the C++ API, you can specify rule3.min_utterance_length
when creating
EndpointConfig
.
Note
If you want to deactive this rule, please provide a very large value
for rule3_min_utterance_length
or rule3.min_utterance_length
.
Demo
Multilingual (Chinese + English)
The following video demonstrates using the Python API of sherpa-ncnn for real-time speech recogntinion with endpointing.
FAQs
How to compute duration of silence
For each frame to be decoded, we can output either a blank or a non-blank token.
We record the number of contiguous blanks that has been decoded so far.
In the current default setting, each frame is 10 ms
. Thus, we can get
the duration of trailing silence by counting the number of contiguous trailing
blanks.
Note
If a model uses a subsampling factor of 4, the time resolution becomes
10 * 4 = 40 ms
.