TTS: ZipVoice (Voice Cloning)

Generate speech with the ZipVoice model using voice cloning. ZipVoice uses a reference audio clip and its transcript to clone the speaker’s voice.

Source files

Synchronous generation

 1// Copyright (c)  2026  Xiaomi Corporation
 2//
 3// Text-to-speech with the ZipVoice model (voice cloning).
 4// Uses a reference audio and reference text to clone the speaker's voice.
 5//
 6// Usage:
 7//   node tts_zipvoice_sync.js
 8//
 9const sherpa_onnx = require('sherpa-onnx-node');
10
11function createOfflineTts() {
12  const config = {
13    model: {
14      zipvoice: {
15        tokens: './sherpa-onnx-zipvoice-distill-int8-zh-en-emilia/tokens.txt',
16        encoder:
17            './sherpa-onnx-zipvoice-distill-int8-zh-en-emilia/encoder.int8.onnx',
18        decoder:
19            './sherpa-onnx-zipvoice-distill-int8-zh-en-emilia/decoder.int8.onnx',
20        vocoder: './vocos_24khz.onnx',
21        dataDir:
22            './sherpa-onnx-zipvoice-distill-int8-zh-en-emilia/espeak-ng-data',
23        lexicon: './sherpa-onnx-zipvoice-distill-int8-zh-en-emilia/lexicon.txt',
24      },
25      debug: true,
26      numThreads: 2,
27      provider: 'cpu',
28    },
29    maxNumSentences: 1,
30  };
31  return new sherpa_onnx.OfflineTts(config);
32}
33
34const tts = createOfflineTts();
35
36const text =
37    '小米的价值观是真诚, 热爱. 真诚,就是不欺人也不自欺. 热爱, 就是全心投入并享受其中.';
38
39// ZipVoice requires a reference audio and its transcript for voice cloning.
40const referenceText =
41    '那还是三十六年前, 一九八七年. 我呢考上了武汉大学的计算机系.';
42const referenceAudioFilename =
43    './sherpa-onnx-zipvoice-distill-int8-zh-en-emilia/test_wavs/leijun-1.wav';
44const referenceWave = sherpa_onnx.readWave(referenceAudioFilename);
45
46const generationConfig = new sherpa_onnx.GenerationConfig({
47  speed: 1.0,
48  referenceAudio: referenceWave.samples,
49  referenceSampleRate: referenceWave.sampleRate,
50  referenceText,
51  numSteps: 4,
52  extra: {min_char_in_sentence: 10},
53});
54
55let start = Date.now();
56const audio = tts.generate({text, generationConfig});
57let stop = Date.now();
58const elapsed_seconds = (stop - start) / 1000;
59const duration = audio.samples.length / audio.sampleRate;
60const real_time_factor = elapsed_seconds / duration;
61console.log('Wave duration', duration.toFixed(3), 'seconds');
62console.log('Elapsed', elapsed_seconds.toFixed(3), 'seconds');
63console.log(
64    `RTF = ${elapsed_seconds.toFixed(3)}/${duration.toFixed(3)} =`,
65    real_time_factor.toFixed(3));
66
67const filename = 'test-zipvoice-zh-en.wav';
68sherpa_onnx.writeWave(
69    filename, {samples: audio.samples, sampleRate: audio.sampleRate});
70
71console.log(`Saved to ${filename}`);

How to run

  1. Install the packages:

    npm install sherpa-onnx-node
    npm install speaker  # only needed for play_async
    
  2. Download the model and vocoder:

    curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/sherpa-onnx-zipvoice-distill-int8-zh-en-emilia.tar.bz2
    tar xf sherpa-onnx-zipvoice-distill-int8-zh-en-emilia.tar.bz2
    rm sherpa-onnx-zipvoice-distill-int8-zh-en-emilia.tar.bz2
    
    curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/vocoder-models/vocos_24khz.onnx
    
  3. Set the library path and run:

    # macOS
    export DYLD_LIBRARY_PATH=$(npm root)/sherpa-onnx-node/lib:$DYLD_LIBRARY_PATH
    
    # Linux
    export LD_LIBRARY_PATH=$(npm root)/sherpa-onnx-node/lib:$LD_LIBRARY_PATH
    
    node tts_zipvoice_sync.js
    

Notes

  • ZipVoice requires a reference audio AND its transcript for voice cloning. The GenerationConfig must include: - referenceAudio: Float32Array of the reference audio samples. - referenceSampleRate: Sample rate of the reference audio. - referenceText: Transcript of the reference audio. - numSteps: Number of diffusion steps (e.g., 4). - extra.min_char_in_sentence: Minimum characters per sentence.

  • The config key is zipvoice with fields: tokens, encoder, decoder, vocoder, dataDir, lexicon.

  • ZipVoice also supports async generation. See the async example and play async example.