TTS: ZipVoice (Voice Cloning)
Generate speech with the ZipVoice model using voice cloning. ZipVoice uses a reference audio clip and its transcript to clone the speaker’s voice.
Source files
Synchronous generation
1// Copyright (c) 2026 Xiaomi Corporation
2//
3// Text-to-speech with the ZipVoice model (voice cloning).
4// Uses a reference audio and reference text to clone the speaker's voice.
5//
6// Usage:
7// node tts_zipvoice_sync.js
8//
9const sherpa_onnx = require('sherpa-onnx-node');
10
11function createOfflineTts() {
12 const config = {
13 model: {
14 zipvoice: {
15 tokens: './sherpa-onnx-zipvoice-distill-int8-zh-en-emilia/tokens.txt',
16 encoder:
17 './sherpa-onnx-zipvoice-distill-int8-zh-en-emilia/encoder.int8.onnx',
18 decoder:
19 './sherpa-onnx-zipvoice-distill-int8-zh-en-emilia/decoder.int8.onnx',
20 vocoder: './vocos_24khz.onnx',
21 dataDir:
22 './sherpa-onnx-zipvoice-distill-int8-zh-en-emilia/espeak-ng-data',
23 lexicon: './sherpa-onnx-zipvoice-distill-int8-zh-en-emilia/lexicon.txt',
24 },
25 debug: true,
26 numThreads: 2,
27 provider: 'cpu',
28 },
29 maxNumSentences: 1,
30 };
31 return new sherpa_onnx.OfflineTts(config);
32}
33
34const tts = createOfflineTts();
35
36const text =
37 '小米的价值观是真诚, 热爱. 真诚,就是不欺人也不自欺. 热爱, 就是全心投入并享受其中.';
38
39// ZipVoice requires a reference audio and its transcript for voice cloning.
40const referenceText =
41 '那还是三十六年前, 一九八七年. 我呢考上了武汉大学的计算机系.';
42const referenceAudioFilename =
43 './sherpa-onnx-zipvoice-distill-int8-zh-en-emilia/test_wavs/leijun-1.wav';
44const referenceWave = sherpa_onnx.readWave(referenceAudioFilename);
45
46const generationConfig = new sherpa_onnx.GenerationConfig({
47 speed: 1.0,
48 referenceAudio: referenceWave.samples,
49 referenceSampleRate: referenceWave.sampleRate,
50 referenceText,
51 numSteps: 4,
52 extra: {min_char_in_sentence: 10},
53});
54
55let start = Date.now();
56const audio = tts.generate({text, generationConfig});
57let stop = Date.now();
58const elapsed_seconds = (stop - start) / 1000;
59const duration = audio.samples.length / audio.sampleRate;
60const real_time_factor = elapsed_seconds / duration;
61console.log('Wave duration', duration.toFixed(3), 'seconds');
62console.log('Elapsed', elapsed_seconds.toFixed(3), 'seconds');
63console.log(
64 `RTF = ${elapsed_seconds.toFixed(3)}/${duration.toFixed(3)} =`,
65 real_time_factor.toFixed(3));
66
67const filename = 'test-zipvoice-zh-en.wav';
68sherpa_onnx.writeWave(
69 filename, {samples: audio.samples, sampleRate: audio.sampleRate});
70
71console.log(`Saved to ${filename}`);
How to run
Install the packages:
npm install sherpa-onnx-node npm install speaker # only needed for play_async
Download the model and vocoder:
curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/sherpa-onnx-zipvoice-distill-int8-zh-en-emilia.tar.bz2 tar xf sherpa-onnx-zipvoice-distill-int8-zh-en-emilia.tar.bz2 rm sherpa-onnx-zipvoice-distill-int8-zh-en-emilia.tar.bz2 curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/vocoder-models/vocos_24khz.onnx
Set the library path and run:
# macOS export DYLD_LIBRARY_PATH=$(npm root)/sherpa-onnx-node/lib:$DYLD_LIBRARY_PATH # Linux export LD_LIBRARY_PATH=$(npm root)/sherpa-onnx-node/lib:$LD_LIBRARY_PATH node tts_zipvoice_sync.js
Notes
ZipVoice requires a reference audio AND its transcript for voice cloning. The
GenerationConfigmust include: -referenceAudio: Float32Array of the reference audio samples. -referenceSampleRate: Sample rate of the reference audio. -referenceText: Transcript of the reference audio. -numSteps: Number of diffusion steps (e.g., 4). -extra.min_char_in_sentence: Minimum characters per sentence.The config key is
zipvoicewith fields:tokens,encoder,decoder,vocoder,dataDir,lexicon.ZipVoice also supports async generation. See the async example and play async example.