Speaker Diarization

Determine who speaks when in an audio file. This example uses Pyanote segmentation, speaker embeddings, and clustering to identify and separate speakers.

Source file

nodejs-addon-examples/test_offline_speaker_diarization.js

Code

 1// Copyright (c)  2024  Xiaomi Corporation
 2//
 3// Offline speaker diarization: determine who speaks when in an audio file.
 4//
 5// Usage:
 6//   node speaker_diarization.js
 7//
 8const sherpa_onnx = require('sherpa-onnx-node');
 9
10// Model files required:
11// 1. Segmentation model:
12//    https://github.com/k2-fsa/sherpa-onnx/releases/download/speaker-segmentation-models/sherpa-onnx-pyannote-segmentation-3-0.tar.bz2
13//
14// 2. Embedding model:
15//    https://github.com/k2-fsa/sherpa-onnx/releases/download/speaker-recongition-models/3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx
16//
17// 3. Test wave file:
18//    https://github.com/k2-fsa/sherpa-onnx/releases/download/speaker-segmentation-models/0-four-speakers-zh.wav
19
20const config = {
21  segmentation: {
22    pyannote: {
23      model: './sherpa-onnx-pyannote-segmentation-3-0/model.onnx',
24    },
25  },
26  embedding: {
27    model: './3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx',
28  },
29  clustering: {
30    // Set numClusters to the expected number of speakers, or -1 to
31    // let the algorithm decide automatically using the threshold.
32    numClusters: 4,
33    // A larger threshold leads to fewer clusters (fewer speakers).
34    // Ignored when numClusters is not -1.
35    threshold: 0.5,
36  },
37  // Discard segments shorter than minDurationOn seconds.
38  minDurationOn: 0.2,
39  // Merge two segments if the gap between them is less than minDurationOff.
40  minDurationOff: 0.5,
41};
42
43const waveFilename = './0-four-speakers-zh.wav';
44
45const sd = new sherpa_onnx.OfflineSpeakerDiarization(config);
46
47const wave = sherpa_onnx.readWave(waveFilename);
48if (sd.sampleRate != wave.sampleRate) {
49  throw new Error(
50      `Expected sample rate: ${sd.sampleRate}, given: ${wave.sampleRate}`);
51}
52
53const segments = sd.process(wave.samples);
54
55// Each segment has: {start, end, speaker}
56console.log('Segments:');
57for (const seg of segments) {
58  console.log(
59      `  Speaker ${seg.speaker}: ${seg.start.toFixed(2)}s - ${seg.end.toFixed(2)}s`);
60}

How to run

  1. Install the package:

    npm install sherpa-onnx-node
    
  2. Download the models and test file:

    curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/speaker-segmentation-models/sherpa-onnx-pyannote-segmentation-3-0.tar.bz2
    tar xvf sherpa-onnx-pyannote-segmentation-3-0.tar.bz2
    rm sherpa-onnx-pyannote-segmentation-3-0.tar.bz2
    
    curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/speaker-recongition-models/3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx
    
    curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/speaker-segmentation-models/0-four-speakers-zh.wav
    
  3. Set the library path and run:

    # macOS
    export DYLD_LIBRARY_PATH=$(npm root)/sherpa-onnx-node/lib:$DYLD_LIBRARY_PATH
    
    # Linux
    export LD_LIBRARY_PATH=$(npm root)/sherpa-onnx-node/lib:$LD_LIBRARY_PATH
    
    node speaker_diarization.js
    

Expected output

Segments:
  Speaker 0: 0.20s - 5.40s
  Speaker 1: 5.80s - 12.30s
  Speaker 2: 12.60s - 18.90s
  Speaker 0: 19.20s - 25.10s
  Speaker 3: 25.50s - 31.20s

Notes

  • The config has three parts: segmentation (detects speech/non-speech boundaries), embedding (computes speaker vectors), and clustering (groups segments by speaker).

  • Set clustering.numClusters to the expected number of speakers if known, or -1 to let the algorithm decide automatically using the threshold.

  • minDurationOn discards segments shorter than the given seconds.

  • minDurationOff merges two segments if the gap is less than the given seconds.

  • Each returned segment has start, end, and speaker fields.