ZipVoice

This page explains how to use sherpa-onnx with ZipVoice.

ZipVoice is an offline zero-shot voice-cloning model. It uses both a reference audio clip and the matching reference text.

Unlike PocketTTS, ZipVoice requires both --reference-audio and --reference-text.

Download a pre-trained model

Download the released ZipVoice archive from https://github.com/k2-fsa/sherpa-onnx/releases/tag/tts-models:

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/sherpa-onnx-zipvoice-distill-int8-zh-en-emilia.tar.bz2
tar xf sherpa-onnx-zipvoice-distill-int8-zh-en-emilia.tar.bz2
rm sherpa-onnx-zipvoice-distill-int8-zh-en-emilia.tar.bz2

You also need to download the vocoder model vocos_24khz.onnx:

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/vocoder-models/vocos_24khz.onnx

Run a command-line example

The following command uses the same model files as rust-api-examples/examples/zipvoice_tts.rs:

./build/bin/sherpa-onnx-offline-tts \
  --zipvoice-encoder=./sherpa-onnx-zipvoice-distill-int8-zh-en-emilia/encoder.int8.onnx \
  --zipvoice-decoder=./sherpa-onnx-zipvoice-distill-int8-zh-en-emilia/decoder.int8.onnx \
  --zipvoice-data-dir=./sherpa-onnx-zipvoice-distill-int8-zh-en-emilia/espeak-ng-data \
  --zipvoice-lexicon=./sherpa-onnx-zipvoice-distill-int8-zh-en-emilia/lexicon.txt \
  --zipvoice-tokens=./sherpa-onnx-zipvoice-distill-int8-zh-en-emilia/tokens.txt \
  --zipvoice-vocoder=./vocos_24khz.onnx \
  --reference-audio=./sherpa-onnx-zipvoice-distill-int8-zh-en-emilia/test_wavs/leijun-1.wav \
  --reference-text="那还是三十六年前, 一九八七年. 我呢考上了武汉大学的计算机系." \
  --num-steps=4 \
  --output-filename=./zipvoice.wav \
  "小米的价值观是真诚, 热爱, 真诚,就是不欺人也不自欺."

You can also use this tracked helper script:

Important

--reference-text should be the exact transcript of --reference-audio. If they do not match, the synthesized voice quality can degrade noticeably.

API examples

Additional example code is available here:

Notes

  • ZipVoice needs both --reference-audio and --reference-text.

  • --reference-text should exactly match what is spoken in --reference-audio.

  • The released example model is bilingual Chinese/English.

  • --num-steps controls the generation quality/speed tradeoff.