Pre-trained models

This page lists pre-trained models for speaker segmentation.

Models for speaker embedding extraction can be found at

https://github.com/k2-fsa/sherpa-onnx/releases/tag/speaker-recongition-models

Colab notebook

We provide a colab notebook for you to try this section step by step.

sherpa-onnx-pyannote-segmentation-3-0

This model is converted from https://huggingface.co/pyannote/segmentation-3.0. You can find the conversion script at https://github.com/k2-fsa/sherpa-onnx/tree/master/scripts/pyannote/segmentation.

In the following, we describe how to use it together with a speaker embedding extraction model for speaker diarization.

Download the model

Please use the following code to download the model:

cd /path/to/sherpa-onnx

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/speaker-segmentation-models/sherpa-onnx-pyannote-segmentation-3-0.tar.bz2
tar xvf sherpa-onnx-pyannote-segmentation-3-0.tar.bz2
rm sherpa-onnx-pyannote-segmentation-3-0.tar.bz2

ls -lh sherpa-onnx-pyannote-segmentation-3-0/{*.onnx,LICENSE,README.md}

You should see the following output:

-rw-r--r--  1 fangjun  staff   1.0K Oct  8 20:54 sherpa-onnx-pyannote-segmentation-3-0/LICENSE
-rw-r--r--  1 fangjun  staff   115B Oct  8 20:54 sherpa-onnx-pyannote-segmentation-3-0/README.md
-rw-r--r--  1 fangjun  staff   1.5M Oct  8 20:54 sherpa-onnx-pyannote-segmentation-3-0/model.int8.onnx
-rw-r--r--  1 fangjun  staff   5.7M Oct  8 20:54 sherpa-onnx-pyannote-segmentation-3-0/model.onnx

Usage for speaker diarization

First, let’s download a test wave file. The model expects wave files of 16kHz, 16-bit and a single channel.

cd /path/to/sherpa-onnx

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/speaker-segmentation-models/0-four-speakers-zh.wav

Next, let’s download a model for extracting speaker embeddings. You can find lots of models from https://github.com/k2-fsa/sherpa-onnx/releases/tag/speaker-recongition-models. We download two models in this example:

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/speaker-recongition-models/3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/speaker-recongition-models/nemo_en_titanet_small.onnx

Now let’s run it.

3D-Speaker + model.onnx

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline-speaker-diarization \
  --clustering.num-clusters=4 \
  --segmentation.pyannote-model=./sherpa-onnx-pyannote-segmentation-3-0/model.onnx \
  --embedding.model=./3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx \
  ./0-four-speakers-zh.wav

# Note: Since we know there are 4 speakers in ./0-four-speakers-zh.wav file, we
# provide the argument --clustering.num-clusters=4.
# If you don't have such information, please use the argument --clustering.cluster-threshold.
# A larger threshold results in fewer speakers.
# A smaller threshold results in more speakers.
#
# Hint: You can use --clustering.cluster-threshold=0.9 for this specific wave file.

The output is given below:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:375 ./build/bin/sherpa-onnx-offline-speaker-diarization --clustering.num-clusters=4 --segmentation.pyannote-model=./sherpa-onnx-pyannote-segmentation-3-0/model.onnx --embedding.model=./3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx ./0-four-speakers-zh.wav 

OfflineSpeakerDiarizationConfig(segmentation=OfflineSpeakerSegmentationModelConfig(pyannote=OfflineSpeakerSegmentationPyannoteModelConfig(model="./sherpa-onnx-pyannote-segmentation-3-0/model.onnx"), num_threads=1, debug=False, provider="cpu"), embedding=SpeakerEmbeddingExtractorConfig(model="./3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx", num_threads=1, debug=False, provider="cpu"), clustering=FastClusteringConfig(num_clusters=4, threshold=0.5), min_duration_on=0.3, min_duration_off=0.5)

Started

0.318 -- 6.865 speaker_00
7.017 -- 10.747 speaker_01
11.455 -- 13.632 speaker_01
13.750 -- 17.041 speaker_02
22.137 -- 24.837 speaker_00
27.638 -- 29.478 speaker_03
30.001 -- 31.553 speaker_03
33.680 -- 37.932 speaker_03
48.040 -- 50.470 speaker_02
52.529 -- 54.605 speaker_00

Duration : 56.861 s
Elapsed seconds: 16.870 s
Real time factor (RTF): 16.870 / 56.861 = 0.297

3D-Speaker + model.int8.onnx

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline-speaker-diarization \
  --clustering.num-clusters=4 \
  --segmentation.pyannote-model=./sherpa-onnx-pyannote-segmentation-3-0/model.int8.onnx \
  --embedding.model=./3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx \
  ./0-four-speakers-zh.wav

The output is given below:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:375 ./build/bin/sherpa-onnx-offline-speaker-diarization --clustering.num-clusters=4 --segmentation.pyannote-model=./sherpa-onnx-pyannote-segmentation-3-0/model.int8.onnx --embedding.model=./3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx ./0-four-speakers-zh.wav 

OfflineSpeakerDiarizationConfig(segmentation=OfflineSpeakerSegmentationModelConfig(pyannote=OfflineSpeakerSegmentationPyannoteModelConfig(model="./sherpa-onnx-pyannote-segmentation-3-0/model.int8.onnx"), num_threads=1, debug=False, provider="cpu"), embedding=SpeakerEmbeddingExtractorConfig(model="./3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx", num_threads=1, debug=False, provider="cpu"), clustering=FastClusteringConfig(num_clusters=4, threshold=0.5), min_duration_on=0.3, min_duration_off=0.5)

Started

0.638 -- 6.848 speaker_00
7.017 -- 10.696 speaker_01
11.472 -- 13.548 speaker_01
13.784 -- 16.990 speaker_02
22.154 -- 24.837 speaker_00
27.655 -- 29.461 speaker_03
30.018 -- 31.503 speaker_03
33.680 -- 37.915 speaker_03
48.040 -- 50.487 speaker_02
52.546 -- 54.605 speaker_00

Duration : 56.861 s
Elapsed seconds: 13.679 s
Real time factor (RTF): 13.679 / 56.861 = 0.241

NeMo + model.onnx

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline-speaker-diarization \
  --clustering.num-clusters=4 \
  --segmentation.pyannote-model=./sherpa-onnx-pyannote-segmentation-3-0/model.onnx \
  --embedding.model=./nemo_en_titanet_small.onnx \
  ./0-four-speakers-zh.wav

The output is given below:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:375 ./build/bin/sherpa-onnx-offline-speaker-diarization --clustering.num-clusters=4 --segmentation.pyannote-model=./sherpa-onnx-pyannote-segmentation-3-0/model.onnx --embedding.model=./nemo_en_titanet_small.onnx ./0-four-speakers-zh.wav 

OfflineSpeakerDiarizationConfig(segmentation=OfflineSpeakerSegmentationModelConfig(pyannote=OfflineSpeakerSegmentationPyannoteModelConfig(model="./sherpa-onnx-pyannote-segmentation-3-0/model.onnx"), num_threads=1, debug=False, provider="cpu"), embedding=SpeakerEmbeddingExtractorConfig(model="./nemo_en_titanet_small.onnx", num_threads=1, debug=False, provider="cpu"), clustering=FastClusteringConfig(num_clusters=4, threshold=0.5), min_duration_on=0.3, min_duration_off=0.5)

Started

0.318 -- 6.865 speaker_00
7.017 -- 10.747 speaker_01
11.455 -- 13.632 speaker_01
13.750 -- 17.041 speaker_02
22.137 -- 24.837 speaker_00
27.638 -- 29.478 speaker_03
30.001 -- 31.553 speaker_03
33.680 -- 37.932 speaker_03
48.040 -- 50.470 speaker_02
52.529 -- 54.605 speaker_00

Duration : 56.861 s
Elapsed seconds: 6.756 s
Real time factor (RTF): 6.756 / 56.861 = 0.119

NeMo + model.int8.onnx

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline-speaker-diarization \
  --clustering.num-clusters=4 \
  --segmentation.pyannote-model=./sherpa-onnx-pyannote-segmentation-3-0/model.int8.onnx \
  --embedding.model=./nemo_en_titanet_small.onnx \
  ./0-four-speakers-zh.wav

The output is given below:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:375 ./build/bin/sherpa-onnx-offline-speaker-diarization --clustering.num-clusters=4 --segmentation.pyannote-model=./sherpa-onnx-pyannote-segmentation-3-0/model.int8.onnx --embedding.model=./nemo_en_titanet_small.onnx ./0-four-speakers-zh.wav 

OfflineSpeakerDiarizationConfig(segmentation=OfflineSpeakerSegmentationModelConfig(pyannote=OfflineSpeakerSegmentationPyannoteModelConfig(model="./sherpa-onnx-pyannote-segmentation-3-0/model.int8.onnx"), num_threads=1, debug=False, provider="cpu"), embedding=SpeakerEmbeddingExtractorConfig(model="./nemo_en_titanet_small.onnx", num_threads=1, debug=False, provider="cpu"), clustering=FastClusteringConfig(num_clusters=4, threshold=0.5), min_duration_on=0.3, min_duration_off=0.5)

Started

0.638 -- 6.848 speaker_00
7.017 -- 10.696 speaker_01
11.472 -- 13.548 speaker_01
13.784 -- 16.990 speaker_02
22.154 -- 24.837 speaker_00
27.655 -- 29.461 speaker_03
30.018 -- 31.503 speaker_03
33.680 -- 37.915 speaker_03
48.040 -- 50.487 speaker_02
52.546 -- 54.605 speaker_00

Duration : 56.861 s
Elapsed seconds: 6.231 s
Real time factor (RTF): 6.231 / 56.861 = 0.110

sherpa-onnx-reverb-diarization-v1

This model is converted from https://huggingface.co/Revai/reverb-diarization-v1. You can find the conversion script at https://github.com/k2-fsa/sherpa-onnx/tree/master/scripts/pyannote/segmentation.

Caution

It is accessible under a non-commercial license. You can find its license at https://huggingface.co/Revai/reverb-diarization-v1/blob/main/LICENSE.

In the following, we describe how to use it together with a speaker embedding extraction model for speaker diarization.

Download the model

Please use the following code to download the model:

cd /path/to/sherpa-onnx

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/speaker-segmentation-models/sherpa-onnx-reverb-diarization-v1.tar.bz2
tar xvf sherpa-onnx-reverb-diarization-v1.tar.bz2
rm sherpa-onnx-reverb-diarization-v1.tar.bz2

ls -lh sherpa-onnx-reverb-diarization-v1/{*.onnx,LICENSE,README.md}

You should see the following output:

-rw-r--r--  1 fangjun  staff    11K Oct 17 10:49 sherpa-onnx-reverb-diarization-v1/LICENSE
-rw-r--r--  1 fangjun  staff   320B Oct 17 10:49 sherpa-onnx-reverb-diarization-v1/README.md
-rw-r--r--  1 fangjun  staff   2.3M Oct 17 10:49 sherpa-onnx-reverb-diarization-v1/model.int8.onnx
-rw-r--r--  1 fangjun  staff   9.1M Oct 17 10:49 sherpa-onnx-reverb-diarization-v1/model.onnx

Usage for speaker diarization

First, let’s download a test wave file. The model expects wave files of 16kHz, 16-bit and a single channel.

cd /path/to/sherpa-onnx

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/speaker-segmentation-models/0-four-speakers-zh.wav

Next, let’s download a model for extracting speaker embeddings. You can find lots of models from https://github.com/k2-fsa/sherpa-onnx/releases/tag/speaker-recongition-models. We download two models in this example:

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/speaker-recongition-models/3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/speaker-recongition-models/nemo_en_titanet_small.onnx

Now let’s run it.

3D-Speaker + model.onnx

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline-speaker-diarization \
  --clustering.num-clusters=4 \
  --segmentation.pyannote-model=./sherpa-onnx-reverb-diarization-v1/model.onnx \
  --embedding.model=./3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx \
  ./0-four-speakers-zh.wav

# Note: Since we know there are 4 speakers in ./0-four-speakers-zh.wav file, we
# provide the argument --clustering.num-clusters=4.
# If you don't have such information, please use the argument --clustering.cluster-threshold.
# A larger threshold results in fewer speakers.
# A smaller threshold results in more speakers.
#
# Hint: You can use --clustering.cluster-threshold=0.9 for this specific wave file.

The output is given below:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:375 ./build/bin/sherpa-onnx-offline-speaker-diarization --clustering.num-clusters=4 --segmentation.pyannote-model=./sherpa-onnx-reverb-diarization-v1/model.onnx --embedding.model=./3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx ./0-four-speakers-zh.wav

OfflineSpeakerDiarizationConfig(segmentation=OfflineSpeakerSegmentationModelConfig(pyannote=OfflineSpeakerSegmentationPyannoteModelConfig(model="./sherpa-onnx-reverb-diarization-v1/model.onnx"), num_threads=1, debug=False, provider="cpu"), embedding=SpeakerEmbeddingExtractorConfig(model="./3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx", num_threads=1, debug=False, provider="cpu"), clustering=FastClusteringConfig(num_clusters=4, threshold=0.5), min_duration_on=0.3, min_duration_off=0.5)

Started

0.031 -- 6.798 speaker_01
7.017 -- 13.649 speaker_00
13.801 -- 16.957 speaker_02
21.023 -- 24.820 speaker_01
27.638 -- 38.017 speaker_03
44.345 -- 45.526 speaker_00
45.526 -- 50.268 speaker_02
52.563 -- 54.605 speaker_01

Duration : 56.861 s
Elapsed seconds: 25.715 s
Real time factor (RTF): 25.715 / 56.861 = 0.452

3D-Speaker + model.int8.onnx

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline-speaker-diarization \
  --clustering.num-clusters=4 \
  --segmentation.pyannote-model=./sherpa-onnx-reverb-diarization-v1/model.int8.onnx \
  --embedding.model=./3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx \
  ./0-four-speakers-zh.wav

The output is given below:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:375 ./build/bin/sherpa-onnx-offline-speaker-diarization --clustering.num-clusters=4 --segmentation.pyannote-model=./sherpa-onnx-reverb-diarization-v1/model.int8.onnx --embedding.model=./3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx ./0-four-speakers-zh.wav

OfflineSpeakerDiarizationConfig(segmentation=OfflineSpeakerSegmentationModelConfig(pyannote=OfflineSpeakerSegmentationPyannoteModelConfig(model="./sherpa-onnx-reverb-diarization-v1/model.int8.onnx"), num_threads=1, debug=False, provider="cpu"), embedding=SpeakerEmbeddingExtractorConfig(model="./3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx", num_threads=1, debug=False, provider="cpu"), clustering=FastClusteringConfig(num_clusters=4, threshold=0.5), min_duration_on=0.3, min_duration_off=0.5)

Started

0.031 -- 6.815 speaker_01
7.017 -- 13.666 speaker_00
13.784 -- 16.973 speaker_02
21.023 -- 24.854 speaker_01
27.655 -- 38.084 speaker_03
38.084 -- 46.943 speaker_00
45.526 -- 50.352 speaker_02
52.580 -- 54.622 speaker_01

Duration : 56.861 s
Elapsed seconds: 22.323 s
Real time factor (RTF): 22.323 / 56.861 = 0.393

NeMo + model.onnx

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline-speaker-diarization \
  --clustering.num-clusters=4 \
  --segmentation.pyannote-model=./sherpa-onnx-reverb-diarization-v1/model.onnx \
  --embedding.model=./nemo_en_titanet_small.onnx \
  ./0-four-speakers-zh.wav

The output is given below:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:375 ./build/bin/sherpa-onnx-offline-speaker-diarization --clustering.num-clusters=4 --segmentation.pyannote-model=./sherpa-onnx-reverb-diarization-v1/model.onnx --embedding.model=./nemo_en_titanet_small.onnx ./0-four-speakers-zh.wav 

OfflineSpeakerDiarizationConfig(segmentation=OfflineSpeakerSegmentationModelConfig(pyannote=OfflineSpeakerSegmentationPyannoteModelConfig(model="./sherpa-onnx-reverb-diarization-v1/model.onnx"), num_threads=1, debug=False, provider="cpu"), embedding=SpeakerEmbeddingExtractorConfig(model="./nemo_en_titanet_small.onnx", num_threads=1, debug=False, provider="cpu"), clustering=FastClusteringConfig(num_clusters=4, threshold=0.5), min_duration_on=0.3, min_duration_off=0.5)

Started

0.031 -- 6.798 speaker_01
7.017 -- 13.649 speaker_00
13.801 -- 16.957 speaker_02
21.023 -- 24.820 speaker_01
27.638 -- 38.017 speaker_02
44.345 -- 45.357 speaker_03
45.290 -- 50.268 speaker_02
52.563 -- 54.605 speaker_01

Duration : 56.861 s
Elapsed seconds: 11.465 s
Real time factor (RTF): 11.465 / 56.861 = 0.202

NeMo + model.int8.onnx

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline-speaker-diarization \
  --clustering.num-clusters=4 \
  --segmentation.pyannote-model=./sherpa-onnx-reverb-diarization-v1/model.int8.onnx \
  --embedding.model=./nemo_en_titanet_small.onnx \
  ./0-four-speakers-zh.wav

The output is given below:

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:375 ./build/bin/sherpa-onnx-offline-speaker-diarization --clustering.num-clusters=4 --segmentation.pyannote-model=./sherpa-onnx-reverb-diarization-v1/model.int8.onnx --embedding.model=./nemo_en_titanet_small.onnx ./0-four-speakers-zh.wav 

OfflineSpeakerDiarizationConfig(segmentation=OfflineSpeakerSegmentationModelConfig(pyannote=OfflineSpeakerSegmentationPyannoteModelConfig(model="./sherpa-onnx-reverb-diarization-v1/model.int8.onnx"), num_threads=1, debug=False, provider="cpu"), embedding=SpeakerEmbeddingExtractorConfig(model="./nemo_en_titanet_small.onnx", num_threads=1, debug=False, provider="cpu"), clustering=FastClusteringConfig(num_clusters=4, threshold=0.5), min_duration_on=0.3, min_duration_off=0.5)

Started

0.031 -- 6.815 speaker_01
7.017 -- 13.666 speaker_00
13.784 -- 16.973 speaker_02
21.023 -- 24.854 speaker_01
27.655 -- 38.877 speaker_02
38.168 -- 45.914 speaker_03
45.526 -- 50.352 speaker_02
52.580 -- 54.622 speaker_01

Duration : 56.861 s
Elapsed seconds: 9.688 s
Real time factor (RTF): 9.688 / 56.861 = 0.170