PocketTTS

This page explains how to use sherpa-onnx with PocketTTS.

PocketTTS is an offline zero-shot text-to-speech model. It uses a short reference audio clip to clone the target voice.

Unlike ZipVoice, PocketTTS does not require a reference transcript. You only need --reference-audio.

Download a pre-trained model

Download the released PocketTTS archive from https://github.com/k2-fsa/sherpa-onnx/releases/tag/tts-models:

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/sherpa-onnx-pocket-tts-int8-2026-01-26.tar.bz2
tar xf sherpa-onnx-pocket-tts-int8-2026-01-26.tar.bz2
rm sherpa-onnx-pocket-tts-int8-2026-01-26.tar.bz2

Run a command-line example

The following command uses the same model files as rust-api-examples/examples/pocket_tts.rs:

./build/bin/sherpa-onnx-offline-tts \
  --pocket-lm-flow=./sherpa-onnx-pocket-tts-int8-2026-01-26/lm_flow.int8.onnx \
  --pocket-lm-main=./sherpa-onnx-pocket-tts-int8-2026-01-26/lm_main.int8.onnx \
  --pocket-encoder=./sherpa-onnx-pocket-tts-int8-2026-01-26/encoder.onnx \
  --pocket-decoder=./sherpa-onnx-pocket-tts-int8-2026-01-26/decoder.int8.onnx \
  --pocket-text-conditioner=./sherpa-onnx-pocket-tts-int8-2026-01-26/text_conditioner.onnx \
  --pocket-vocab-json=./sherpa-onnx-pocket-tts-int8-2026-01-26/vocab.json \
  --pocket-token-scores-json=./sherpa-onnx-pocket-tts-int8-2026-01-26/token_scores.json \
  --reference-audio=./sherpa-onnx-pocket-tts-int8-2026-01-26/test_wavs/bria.wav \
  --num-steps=2 \
  --output-filename=./pocket.wav \
  "Today as always, men fall into two groups: slaves and free men."

You can also use this tracked helper script:

API examples

Additional example code is available here:

Notes

  • PocketTTS needs a reference audio clip.

  • PocketTTS does not require reference text. This is different from ZipVoice.

  • The reference audio should contain the voice that you want to clone.

  • --num-steps controls the generation quality/speed tradeoff.