Speaker Diarization
Determine who speaks when in an audio file. This example uses Pyanote segmentation, speaker embeddings, and clustering to identify and separate speakers.
Source file
Code
1// Copyright (c) 2024 Xiaomi Corporation
2//
3// Offline speaker diarization: determine who speaks when in an audio file.
4//
5// Usage:
6// node speaker_diarization.js
7//
8const sherpa_onnx = require('sherpa-onnx-node');
9
10// Model files required:
11// 1. Segmentation model:
12// https://github.com/k2-fsa/sherpa-onnx/releases/download/speaker-segmentation-models/sherpa-onnx-pyannote-segmentation-3-0.tar.bz2
13//
14// 2. Embedding model:
15// https://github.com/k2-fsa/sherpa-onnx/releases/download/speaker-recongition-models/3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx
16//
17// 3. Test wave file:
18// https://github.com/k2-fsa/sherpa-onnx/releases/download/speaker-segmentation-models/0-four-speakers-zh.wav
19
20const config = {
21 segmentation: {
22 pyannote: {
23 model: './sherpa-onnx-pyannote-segmentation-3-0/model.onnx',
24 },
25 },
26 embedding: {
27 model: './3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx',
28 },
29 clustering: {
30 // Set numClusters to the expected number of speakers, or -1 to
31 // let the algorithm decide automatically using the threshold.
32 numClusters: 4,
33 // A larger threshold leads to fewer clusters (fewer speakers).
34 // Ignored when numClusters is not -1.
35 threshold: 0.5,
36 },
37 // Discard segments shorter than minDurationOn seconds.
38 minDurationOn: 0.2,
39 // Merge two segments if the gap between them is less than minDurationOff.
40 minDurationOff: 0.5,
41};
42
43const waveFilename = './0-four-speakers-zh.wav';
44
45const sd = new sherpa_onnx.OfflineSpeakerDiarization(config);
46
47const wave = sherpa_onnx.readWave(waveFilename);
48if (sd.sampleRate != wave.sampleRate) {
49 throw new Error(
50 `Expected sample rate: ${sd.sampleRate}, given: ${wave.sampleRate}`);
51}
52
53const segments = sd.process(wave.samples);
54
55// Each segment has: {start, end, speaker}
56console.log('Segments:');
57for (const seg of segments) {
58 console.log(
59 ` Speaker ${seg.speaker}: ${seg.start.toFixed(2)}s - ${seg.end.toFixed(2)}s`);
60}
How to run
Install the package:
npm install sherpa-onnx-node
Download the models and test file:
curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/speaker-segmentation-models/sherpa-onnx-pyannote-segmentation-3-0.tar.bz2 tar xvf sherpa-onnx-pyannote-segmentation-3-0.tar.bz2 rm sherpa-onnx-pyannote-segmentation-3-0.tar.bz2 curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/speaker-recongition-models/3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/speaker-segmentation-models/0-four-speakers-zh.wav
Set the library path and run:
# macOS export DYLD_LIBRARY_PATH=$(npm root)/sherpa-onnx-node/lib:$DYLD_LIBRARY_PATH # Linux export LD_LIBRARY_PATH=$(npm root)/sherpa-onnx-node/lib:$LD_LIBRARY_PATH node speaker_diarization.js
Expected output
Segments:
Speaker 0: 0.20s - 5.40s
Speaker 1: 5.80s - 12.30s
Speaker 2: 12.60s - 18.90s
Speaker 0: 19.20s - 25.10s
Speaker 3: 25.50s - 31.20s
Notes
The config has three parts:
segmentation(detects speech/non-speech boundaries),embedding(computes speaker vectors), andclustering(groups segments by speaker).Set
clustering.numClustersto the expected number of speakers if known, or-1to let the algorithm decide automatically using the threshold.minDurationOndiscards segments shorter than the given seconds.minDurationOffmerges two segments if the gap is less than the given seconds.Each returned segment has
start,end, andspeakerfields.