Pre-trained models
This page lists pre-trained models for speaker segmentation.
Models for speaker embedding extraction can be found at
Colab notebook
We provide a colab notebook for you to try this section step by step.
sherpa-onnx-pyannote-segmentation-3-0
This model is converted from https://huggingface.co/pyannote/segmentation-3.0. You can find the conversion script at https://github.com/k2-fsa/sherpa-onnx/tree/master/scripts/pyannote/segmentation.
In the following, we describe how to use it together with a speaker embedding extraction model for speaker diarization.
Download the model
Please use the following code to download the model:
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/speaker-segmentation-models/sherpa-onnx-pyannote-segmentation-3-0.tar.bz2
tar xvf sherpa-onnx-pyannote-segmentation-3-0.tar.bz2
rm sherpa-onnx-pyannote-segmentation-3-0.tar.bz2
ls -lh sherpa-onnx-pyannote-segmentation-3-0/{*.onnx,LICENSE,README.md}
You should see the following output:
-rw-r--r-- 1 fangjun staff 1.0K Oct 8 20:54 sherpa-onnx-pyannote-segmentation-3-0/LICENSE
-rw-r--r-- 1 fangjun staff 115B Oct 8 20:54 sherpa-onnx-pyannote-segmentation-3-0/README.md
-rw-r--r-- 1 fangjun staff 1.5M Oct 8 20:54 sherpa-onnx-pyannote-segmentation-3-0/model.int8.onnx
-rw-r--r-- 1 fangjun staff 5.7M Oct 8 20:54 sherpa-onnx-pyannote-segmentation-3-0/model.onnx
Usage for speaker diarization
First, let’s download a test wave file. The model expects wave files of 16kHz, 16-bit and a single channel.
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/speaker-segmentation-models/0-four-speakers-zh.wav
Next, let’s download a model for extracting speaker embeddings. You can find lots of models from https://github.com/k2-fsa/sherpa-onnx/releases/tag/speaker-recongition-models. We download two models in this example:
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/speaker-recongition-models/3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/speaker-recongition-models/nemo_en_titanet_small.onnx
Now let’s run it.
3D-Speaker + model.onnx
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline-speaker-diarization \
--clustering.num-clusters=4 \
--segmentation.pyannote-model=./sherpa-onnx-pyannote-segmentation-3-0/model.onnx \
--embedding.model=./3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx \
./0-four-speakers-zh.wav
# Note: Since we know there are 4 speakers in ./0-four-speakers-zh.wav file, we
# provide the argument --clustering.num-clusters=4.
# If you don't have such information, please use the argument --clustering.cluster-threshold.
# A larger threshold results in fewer speakers.
# A smaller threshold results in more speakers.
#
# Hint: You can use --clustering.cluster-threshold=0.9 for this specific wave file.
The output is given below:
Click ▶ to see the output.
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:375 ./build/bin/sherpa-onnx-offline-speaker-diarization --clustering.num-clusters=4 --segmentation.pyannote-model=./sherpa-onnx-pyannote-segmentation-3-0/model.onnx --embedding.model=./3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx ./0-four-speakers-zh.wav
OfflineSpeakerDiarizationConfig(segmentation=OfflineSpeakerSegmentationModelConfig(pyannote=OfflineSpeakerSegmentationPyannoteModelConfig(model="./sherpa-onnx-pyannote-segmentation-3-0/model.onnx"), num_threads=1, debug=False, provider="cpu"), embedding=SpeakerEmbeddingExtractorConfig(model="./3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx", num_threads=1, debug=False, provider="cpu"), clustering=FastClusteringConfig(num_clusters=4, threshold=0.5), min_duration_on=0.3, min_duration_off=0.5)
Started
0.318 -- 6.865 speaker_00
7.017 -- 10.747 speaker_01
11.455 -- 13.632 speaker_01
13.750 -- 17.041 speaker_02
22.137 -- 24.837 speaker_00
27.638 -- 29.478 speaker_03
30.001 -- 31.553 speaker_03
33.680 -- 37.932 speaker_03
48.040 -- 50.470 speaker_02
52.529 -- 54.605 speaker_00
Duration : 56.861 s
Elapsed seconds: 16.870 s
Real time factor (RTF): 16.870 / 56.861 = 0.297
3D-Speaker + model.int8.onnx
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline-speaker-diarization \
--clustering.num-clusters=4 \
--segmentation.pyannote-model=./sherpa-onnx-pyannote-segmentation-3-0/model.int8.onnx \
--embedding.model=./3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx \
./0-four-speakers-zh.wav
The output is given below:
Click ▶ to see the output.
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:375 ./build/bin/sherpa-onnx-offline-speaker-diarization --clustering.num-clusters=4 --segmentation.pyannote-model=./sherpa-onnx-pyannote-segmentation-3-0/model.int8.onnx --embedding.model=./3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx ./0-four-speakers-zh.wav
OfflineSpeakerDiarizationConfig(segmentation=OfflineSpeakerSegmentationModelConfig(pyannote=OfflineSpeakerSegmentationPyannoteModelConfig(model="./sherpa-onnx-pyannote-segmentation-3-0/model.int8.onnx"), num_threads=1, debug=False, provider="cpu"), embedding=SpeakerEmbeddingExtractorConfig(model="./3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx", num_threads=1, debug=False, provider="cpu"), clustering=FastClusteringConfig(num_clusters=4, threshold=0.5), min_duration_on=0.3, min_duration_off=0.5)
Started
0.638 -- 6.848 speaker_00
7.017 -- 10.696 speaker_01
11.472 -- 13.548 speaker_01
13.784 -- 16.990 speaker_02
22.154 -- 24.837 speaker_00
27.655 -- 29.461 speaker_03
30.018 -- 31.503 speaker_03
33.680 -- 37.915 speaker_03
48.040 -- 50.487 speaker_02
52.546 -- 54.605 speaker_00
Duration : 56.861 s
Elapsed seconds: 13.679 s
Real time factor (RTF): 13.679 / 56.861 = 0.241
NeMo + model.onnx
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline-speaker-diarization \
--clustering.num-clusters=4 \
--segmentation.pyannote-model=./sherpa-onnx-pyannote-segmentation-3-0/model.onnx \
--embedding.model=./nemo_en_titanet_small.onnx \
./0-four-speakers-zh.wav
The output is given below:
Click ▶ to see the output.
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:375 ./build/bin/sherpa-onnx-offline-speaker-diarization --clustering.num-clusters=4 --segmentation.pyannote-model=./sherpa-onnx-pyannote-segmentation-3-0/model.onnx --embedding.model=./nemo_en_titanet_small.onnx ./0-four-speakers-zh.wav
OfflineSpeakerDiarizationConfig(segmentation=OfflineSpeakerSegmentationModelConfig(pyannote=OfflineSpeakerSegmentationPyannoteModelConfig(model="./sherpa-onnx-pyannote-segmentation-3-0/model.onnx"), num_threads=1, debug=False, provider="cpu"), embedding=SpeakerEmbeddingExtractorConfig(model="./nemo_en_titanet_small.onnx", num_threads=1, debug=False, provider="cpu"), clustering=FastClusteringConfig(num_clusters=4, threshold=0.5), min_duration_on=0.3, min_duration_off=0.5)
Started
0.318 -- 6.865 speaker_00
7.017 -- 10.747 speaker_01
11.455 -- 13.632 speaker_01
13.750 -- 17.041 speaker_02
22.137 -- 24.837 speaker_00
27.638 -- 29.478 speaker_03
30.001 -- 31.553 speaker_03
33.680 -- 37.932 speaker_03
48.040 -- 50.470 speaker_02
52.529 -- 54.605 speaker_00
Duration : 56.861 s
Elapsed seconds: 6.756 s
Real time factor (RTF): 6.756 / 56.861 = 0.119
NeMo + model.int8.onnx
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline-speaker-diarization \
--clustering.num-clusters=4 \
--segmentation.pyannote-model=./sherpa-onnx-pyannote-segmentation-3-0/model.int8.onnx \
--embedding.model=./nemo_en_titanet_small.onnx \
./0-four-speakers-zh.wav
The output is given below:
Click ▶ to see the output.
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:375 ./build/bin/sherpa-onnx-offline-speaker-diarization --clustering.num-clusters=4 --segmentation.pyannote-model=./sherpa-onnx-pyannote-segmentation-3-0/model.int8.onnx --embedding.model=./nemo_en_titanet_small.onnx ./0-four-speakers-zh.wav
OfflineSpeakerDiarizationConfig(segmentation=OfflineSpeakerSegmentationModelConfig(pyannote=OfflineSpeakerSegmentationPyannoteModelConfig(model="./sherpa-onnx-pyannote-segmentation-3-0/model.int8.onnx"), num_threads=1, debug=False, provider="cpu"), embedding=SpeakerEmbeddingExtractorConfig(model="./nemo_en_titanet_small.onnx", num_threads=1, debug=False, provider="cpu"), clustering=FastClusteringConfig(num_clusters=4, threshold=0.5), min_duration_on=0.3, min_duration_off=0.5)
Started
0.638 -- 6.848 speaker_00
7.017 -- 10.696 speaker_01
11.472 -- 13.548 speaker_01
13.784 -- 16.990 speaker_02
22.154 -- 24.837 speaker_00
27.655 -- 29.461 speaker_03
30.018 -- 31.503 speaker_03
33.680 -- 37.915 speaker_03
48.040 -- 50.487 speaker_02
52.546 -- 54.605 speaker_00
Duration : 56.861 s
Elapsed seconds: 6.231 s
Real time factor (RTF): 6.231 / 56.861 = 0.110
sherpa-onnx-reverb-diarization-v1
This model is converted from https://huggingface.co/Revai/reverb-diarization-v1. You can find the conversion script at https://github.com/k2-fsa/sherpa-onnx/tree/master/scripts/pyannote/segmentation.
Caution
It is accessible under a non-commercial
license.
You can find its license at https://huggingface.co/Revai/reverb-diarization-v1/blob/main/LICENSE.
In the following, we describe how to use it together with a speaker embedding extraction model for speaker diarization.
Download the model
Please use the following code to download the model:
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/speaker-segmentation-models/sherpa-onnx-reverb-diarization-v1.tar.bz2
tar xvf sherpa-onnx-reverb-diarization-v1.tar.bz2
rm sherpa-onnx-reverb-diarization-v1.tar.bz2
ls -lh sherpa-onnx-reverb-diarization-v1/{*.onnx,LICENSE,README.md}
You should see the following output:
-rw-r--r-- 1 fangjun staff 11K Oct 17 10:49 sherpa-onnx-reverb-diarization-v1/LICENSE
-rw-r--r-- 1 fangjun staff 320B Oct 17 10:49 sherpa-onnx-reverb-diarization-v1/README.md
-rw-r--r-- 1 fangjun staff 2.3M Oct 17 10:49 sherpa-onnx-reverb-diarization-v1/model.int8.onnx
-rw-r--r-- 1 fangjun staff 9.1M Oct 17 10:49 sherpa-onnx-reverb-diarization-v1/model.onnx
Usage for speaker diarization
First, let’s download a test wave file. The model expects wave files of 16kHz, 16-bit and a single channel.
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/speaker-segmentation-models/0-four-speakers-zh.wav
Next, let’s download a model for extracting speaker embeddings. You can find lots of models from https://github.com/k2-fsa/sherpa-onnx/releases/tag/speaker-recongition-models. We download two models in this example:
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/speaker-recongition-models/3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/speaker-recongition-models/nemo_en_titanet_small.onnx
Now let’s run it.
3D-Speaker + model.onnx
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline-speaker-diarization \
--clustering.num-clusters=4 \
--segmentation.pyannote-model=./sherpa-onnx-reverb-diarization-v1/model.onnx \
--embedding.model=./3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx \
./0-four-speakers-zh.wav
# Note: Since we know there are 4 speakers in ./0-four-speakers-zh.wav file, we
# provide the argument --clustering.num-clusters=4.
# If you don't have such information, please use the argument --clustering.cluster-threshold.
# A larger threshold results in fewer speakers.
# A smaller threshold results in more speakers.
#
# Hint: You can use --clustering.cluster-threshold=0.9 for this specific wave file.
The output is given below:
Click ▶ to see the output.
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:375 ./build/bin/sherpa-onnx-offline-speaker-diarization --clustering.num-clusters=4 --segmentation.pyannote-model=./sherpa-onnx-reverb-diarization-v1/model.onnx --embedding.model=./3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx ./0-four-speakers-zh.wav
OfflineSpeakerDiarizationConfig(segmentation=OfflineSpeakerSegmentationModelConfig(pyannote=OfflineSpeakerSegmentationPyannoteModelConfig(model="./sherpa-onnx-reverb-diarization-v1/model.onnx"), num_threads=1, debug=False, provider="cpu"), embedding=SpeakerEmbeddingExtractorConfig(model="./3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx", num_threads=1, debug=False, provider="cpu"), clustering=FastClusteringConfig(num_clusters=4, threshold=0.5), min_duration_on=0.3, min_duration_off=0.5)
Started
0.031 -- 6.798 speaker_01
7.017 -- 13.649 speaker_00
13.801 -- 16.957 speaker_02
21.023 -- 24.820 speaker_01
27.638 -- 38.017 speaker_03
44.345 -- 45.526 speaker_00
45.526 -- 50.268 speaker_02
52.563 -- 54.605 speaker_01
Duration : 56.861 s
Elapsed seconds: 25.715 s
Real time factor (RTF): 25.715 / 56.861 = 0.452
3D-Speaker + model.int8.onnx
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline-speaker-diarization \
--clustering.num-clusters=4 \
--segmentation.pyannote-model=./sherpa-onnx-reverb-diarization-v1/model.int8.onnx \
--embedding.model=./3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx \
./0-four-speakers-zh.wav
The output is given below:
Click ▶ to see the output.
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:375 ./build/bin/sherpa-onnx-offline-speaker-diarization --clustering.num-clusters=4 --segmentation.pyannote-model=./sherpa-onnx-reverb-diarization-v1/model.int8.onnx --embedding.model=./3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx ./0-four-speakers-zh.wav
OfflineSpeakerDiarizationConfig(segmentation=OfflineSpeakerSegmentationModelConfig(pyannote=OfflineSpeakerSegmentationPyannoteModelConfig(model="./sherpa-onnx-reverb-diarization-v1/model.int8.onnx"), num_threads=1, debug=False, provider="cpu"), embedding=SpeakerEmbeddingExtractorConfig(model="./3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx", num_threads=1, debug=False, provider="cpu"), clustering=FastClusteringConfig(num_clusters=4, threshold=0.5), min_duration_on=0.3, min_duration_off=0.5)
Started
0.031 -- 6.815 speaker_01
7.017 -- 13.666 speaker_00
13.784 -- 16.973 speaker_02
21.023 -- 24.854 speaker_01
27.655 -- 38.084 speaker_03
38.084 -- 46.943 speaker_00
45.526 -- 50.352 speaker_02
52.580 -- 54.622 speaker_01
Duration : 56.861 s
Elapsed seconds: 22.323 s
Real time factor (RTF): 22.323 / 56.861 = 0.393
NeMo + model.onnx
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline-speaker-diarization \
--clustering.num-clusters=4 \
--segmentation.pyannote-model=./sherpa-onnx-reverb-diarization-v1/model.onnx \
--embedding.model=./nemo_en_titanet_small.onnx \
./0-four-speakers-zh.wav
The output is given below:
Click ▶ to see the output.
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:375 ./build/bin/sherpa-onnx-offline-speaker-diarization --clustering.num-clusters=4 --segmentation.pyannote-model=./sherpa-onnx-reverb-diarization-v1/model.onnx --embedding.model=./nemo_en_titanet_small.onnx ./0-four-speakers-zh.wav
OfflineSpeakerDiarizationConfig(segmentation=OfflineSpeakerSegmentationModelConfig(pyannote=OfflineSpeakerSegmentationPyannoteModelConfig(model="./sherpa-onnx-reverb-diarization-v1/model.onnx"), num_threads=1, debug=False, provider="cpu"), embedding=SpeakerEmbeddingExtractorConfig(model="./nemo_en_titanet_small.onnx", num_threads=1, debug=False, provider="cpu"), clustering=FastClusteringConfig(num_clusters=4, threshold=0.5), min_duration_on=0.3, min_duration_off=0.5)
Started
0.031 -- 6.798 speaker_01
7.017 -- 13.649 speaker_00
13.801 -- 16.957 speaker_02
21.023 -- 24.820 speaker_01
27.638 -- 38.017 speaker_02
44.345 -- 45.357 speaker_03
45.290 -- 50.268 speaker_02
52.563 -- 54.605 speaker_01
Duration : 56.861 s
Elapsed seconds: 11.465 s
Real time factor (RTF): 11.465 / 56.861 = 0.202
NeMo + model.int8.onnx
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline-speaker-diarization \
--clustering.num-clusters=4 \
--segmentation.pyannote-model=./sherpa-onnx-reverb-diarization-v1/model.int8.onnx \
--embedding.model=./nemo_en_titanet_small.onnx \
./0-four-speakers-zh.wav
The output is given below:
Click ▶ to see the output.
/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:375 ./build/bin/sherpa-onnx-offline-speaker-diarization --clustering.num-clusters=4 --segmentation.pyannote-model=./sherpa-onnx-reverb-diarization-v1/model.int8.onnx --embedding.model=./nemo_en_titanet_small.onnx ./0-four-speakers-zh.wav
OfflineSpeakerDiarizationConfig(segmentation=OfflineSpeakerSegmentationModelConfig(pyannote=OfflineSpeakerSegmentationPyannoteModelConfig(model="./sherpa-onnx-reverb-diarization-v1/model.int8.onnx"), num_threads=1, debug=False, provider="cpu"), embedding=SpeakerEmbeddingExtractorConfig(model="./nemo_en_titanet_small.onnx", num_threads=1, debug=False, provider="cpu"), clustering=FastClusteringConfig(num_clusters=4, threshold=0.5), min_duration_on=0.3, min_duration_off=0.5)
Started
0.031 -- 6.815 speaker_01
7.017 -- 13.666 speaker_00
13.784 -- 16.973 speaker_02
21.023 -- 24.854 speaker_01
27.655 -- 38.877 speaker_02
38.168 -- 45.914 speaker_03
45.526 -- 50.352 speaker_02
52.580 -- 54.622 speaker_01
Duration : 56.861 s
Elapsed seconds: 9.688 s
Real time factor (RTF): 9.688 / 56.861 = 0.170