Speaker Diarization

Determine who speaks when in an audio file. This example uses Pyanote segmentation, speaker embeddings, and clustering to identify and separate speakers.

Source file

nodejs-addon-examples/test_offline_speaker_diarization.js

Code

// Copyright (c)  2024  Xiaomi Corporation
//
// Offline speaker diarization: determine who speaks when in an audio file.
//
// Usage:
//   node speaker_diarization.js
//
const sherpa_onnx = require('sherpa-onnx-node');

// Model files required:
// 1. Segmentation model:
//    https://github.com/k2-fsa/sherpa-onnx/releases/download/speaker-segmentation-models/sherpa-onnx-pyannote-segmentation-3-0.tar.bz2
//
// 2. Embedding model:
//    https://github.com/k2-fsa/sherpa-onnx/releases/download/speaker-recongition-models/3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx
//
// 3. Test wave file:
//    https://github.com/k2-fsa/sherpa-onnx/releases/download/speaker-segmentation-models/0-four-speakers-zh.wav

const config = {
  segmentation: {
    pyannote: {
      model: './sherpa-onnx-pyannote-segmentation-3-0/model.onnx',
    },
  },
  embedding: {
    model: './3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx',
  },
  clustering: {
    // Set numClusters to the expected number of speakers, or -1 to
    // let the algorithm decide automatically using the threshold.
    numClusters: 4,
    // A larger threshold leads to fewer clusters (fewer speakers).
    // Ignored when numClusters is not -1.
    threshold: 0.5,
  },
  // Discard segments shorter than minDurationOn seconds.
  minDurationOn: 0.2,
  // Merge two segments if the gap between them is less than minDurationOff.
  minDurationOff: 0.5,
};

const waveFilename = './0-four-speakers-zh.wav';

const sd = new sherpa_onnx.OfflineSpeakerDiarization(config);

const wave = sherpa_onnx.readWave(waveFilename);
if (sd.sampleRate != wave.sampleRate) {
  throw new Error(
      `Expected sample rate: ${sd.sampleRate}, given: ${wave.sampleRate}`);
}

const segments = sd.process(wave.samples);

// Each segment has: {start, end, speaker}
console.log('Segments:');
for (const seg of segments) {
  console.log(
      `  Speaker ${seg.speaker}: ${seg.start.toFixed(2)}s - ${seg.end.toFixed(2)}s`);
}

How to run

Install the package:
```
npm install sherpa-onnx-node
```

Download the models and test file:

curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/speaker-segmentation-models/sherpa-onnx-pyannote-segmentation-3-0.tar.bz2
tar xvf sherpa-onnx-pyannote-segmentation-3-0.tar.bz2
rm sherpa-onnx-pyannote-segmentation-3-0.tar.bz2

curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/speaker-recongition-models/3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx

curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/speaker-segmentation-models/0-four-speakers-zh.wav

Set the library path and run:

# macOS
export DYLD_LIBRARY_PATH=$(npm root)/sherpa-onnx-node/lib:$DYLD_LIBRARY_PATH

# Linux
export LD_LIBRARY_PATH=$(npm root)/sherpa-onnx-node/lib:$LD_LIBRARY_PATH

node speaker_diarization.js

Expected output

Segments:
  Speaker 0: 0.20s - 5.40s
  Speaker 1: 5.80s - 12.30s
  Speaker 2: 12.60s - 18.90s
  Speaker 0: 19.20s - 25.10s
  Speaker 3: 25.50s - 31.20s

Notes

The config has three parts: segmentation (detects speech/non-speech boundaries), embedding (computes speaker vectors), and clustering (groups segments by speaker).
Set clustering.numClusters to the expected number of speakers if known, or -1 to let the algorithm decide automatically using the threshold.
minDurationOn discards segments shorter than the given seconds.
minDurationOff merges two segments if the gap is less than the given seconds.
Each returned segment has start, end, and speaker fields.