ZipVoice
This page explains how to use sherpa-onnx with ZipVoice.
ZipVoice is an offline zero-shot voice-cloning model. It uses both a reference audio clip and the matching reference text.
Unlike PocketTTS, ZipVoice requires both --reference-audio and
--reference-text.
Download a pre-trained model
Download the released ZipVoice archive from https://github.com/k2-fsa/sherpa-onnx/releases/tag/tts-models:
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/sherpa-onnx-zipvoice-distill-int8-zh-en-emilia.tar.bz2
tar xf sherpa-onnx-zipvoice-distill-int8-zh-en-emilia.tar.bz2
rm sherpa-onnx-zipvoice-distill-int8-zh-en-emilia.tar.bz2
You also need to download the vocoder model vocos_24khz.onnx:
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/vocoder-models/vocos_24khz.onnx
Run a command-line example
The following command uses the same model files as rust-api-examples/examples/zipvoice_tts.rs:
./build/bin/sherpa-onnx-offline-tts \
--zipvoice-encoder=./sherpa-onnx-zipvoice-distill-int8-zh-en-emilia/encoder.int8.onnx \
--zipvoice-decoder=./sherpa-onnx-zipvoice-distill-int8-zh-en-emilia/decoder.int8.onnx \
--zipvoice-data-dir=./sherpa-onnx-zipvoice-distill-int8-zh-en-emilia/espeak-ng-data \
--zipvoice-lexicon=./sherpa-onnx-zipvoice-distill-int8-zh-en-emilia/lexicon.txt \
--zipvoice-tokens=./sherpa-onnx-zipvoice-distill-int8-zh-en-emilia/tokens.txt \
--zipvoice-vocoder=./vocos_24khz.onnx \
--reference-audio=./sherpa-onnx-zipvoice-distill-int8-zh-en-emilia/test_wavs/leijun-1.wav \
--reference-text="那还是三十六年前, 一九八七年. 我呢考上了武汉大学的计算机系." \
--num-steps=4 \
--output-filename=./zipvoice.wav \
"小米的价值观是真诚, 热爱, 真诚,就是不欺人也不自欺."
You can also use this tracked helper script:
Important
--reference-text should be the exact transcript of --reference-audio.
If they do not match, the synthesized voice quality can degrade noticeably.
API examples
Additional example code is available here:
Rust
C++ and C
Python
Go
Java and Kotlin
Dart and Swift
.NET
JavaScript
Pascal
Notes
ZipVoice needs both
--reference-audioand--reference-text.--reference-textshould exactly match what is spoken in--reference-audio.The released example model is bilingual Chinese/English.
--num-stepscontrols the generation quality/speed tradeoff.