TTS: Pocket (Voice Cloning)
Generate speech with the Pocket TTS model using voice cloning. Pocket uses a reference audio clip to clone the speaker’s voice for the generated speech.
Source file
Code
1// Copyright (c) 2026 Xiaomi Corporation
2//
3// Text-to-speech with the Pocket TTS model (voice cloning).
4// Uses a reference audio to clone the speaker's voice.
5//
6// Usage:
7// node tts_pocket_sync.js
8//
9const sherpa_onnx = require('sherpa-onnx-node');
10
11function createOfflineTts() {
12 const config = {
13 model: {
14 pocket: {
15 lmFlow: './sherpa-onnx-pocket-tts-int8-2026-01-26/lm_flow.int8.onnx',
16 lmMain: './sherpa-onnx-pocket-tts-int8-2026-01-26/lm_main.int8.onnx',
17 encoder: './sherpa-onnx-pocket-tts-int8-2026-01-26/encoder.onnx',
18 decoder: './sherpa-onnx-pocket-tts-int8-2026-01-26/decoder.int8.onnx',
19 textConditioner:
20 './sherpa-onnx-pocket-tts-int8-2026-01-26/text_conditioner.onnx',
21 vocabJson: './sherpa-onnx-pocket-tts-int8-2026-01-26/vocab.json',
22 tokenScoresJson:
23 './sherpa-onnx-pocket-tts-int8-2026-01-26/token_scores.json',
24 voiceEmbeddingCacheCapacity: 50,
25 },
26 debug: true,
27 numThreads: 2,
28 provider: 'cpu',
29 },
30 maxNumSentences: 1,
31 };
32 return new sherpa_onnx.OfflineTts(config);
33}
34
35const tts = createOfflineTts();
36
37const text =
38 'Today as always, men fall into two groups: slaves and free men. Whoever does not have two-thirds of his day for himself, is a slave, whatever he may be: a statesman, a businessman, an official, or a scholar.';
39
40// Pocket TTS uses reference audio for voice cloning.
41const referenceAudioFilename =
42 './sherpa-onnx-pocket-tts-int8-2026-01-26/test_wavs/bria.wav';
43const referenceWave = sherpa_onnx.readWave(referenceAudioFilename);
44
45const generationConfig = new sherpa_onnx.GenerationConfig({
46 speed: 1.0,
47 referenceAudio: referenceWave.samples,
48 referenceSampleRate: referenceWave.sampleRate,
49 numSteps: 5,
50 extra: {max_reference_audio_len: 12, seed: 42}
51});
52
53let start = Date.now();
54const audio = tts.generate({text, generationConfig});
55let stop = Date.now();
56const elapsed_seconds = (stop - start) / 1000;
57const duration = audio.samples.length / audio.sampleRate;
58const real_time_factor = elapsed_seconds / duration;
59console.log('Wave duration', duration.toFixed(3), 'seconds');
60console.log('Elapsed', elapsed_seconds.toFixed(3), 'seconds');
61console.log(
62 `RTF = ${elapsed_seconds.toFixed(3)}/${duration.toFixed(3)} =`,
63 real_time_factor.toFixed(3));
64
65const filename = 'test-pocket-bria.wav';
66sherpa_onnx.writeWave(
67 filename, {samples: audio.samples, sampleRate: audio.sampleRate});
68
69console.log(`Saved to ${filename}`);
How to run
Install the package:
npm install sherpa-onnx-node
Download the model:
curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/sherpa-onnx-pocket-tts-int8-2026-01-26.tar.bz2 tar xf sherpa-onnx-pocket-tts-int8-2026-01-26.tar.bz2 rm sherpa-onnx-pocket-tts-int8-2026-01-26.tar.bz2
Set the library path and run:
# macOS export DYLD_LIBRARY_PATH=$(npm root)/sherpa-onnx-node/lib:$DYLD_LIBRARY_PATH # Linux export LD_LIBRARY_PATH=$(npm root)/sherpa-onnx-node/lib:$LD_LIBRARY_PATH node tts_pocket_sync.js
Notes
Pocket TTS uses voice cloning via
referenceAudioin theGenerationConfig. Provide a WAV file of the target speaker.The config key is
pocketwith fields:lmFlow,lmMain,encoder,decoder,textConditioner,vocabJson,tokenScoresJson,voiceEmbeddingCacheCapacity.GenerationConfigfields for Pocket: -referenceAudio: Float32Array of the reference audio samples. -referenceSampleRate: Sample rate of the reference audio. -numSteps: Number of diffusion steps (e.g., 5). -extra.max_reference_audio_len: Max reference audio length in seconds. -extra.seed: Random seed for reproducibility.Pocket also supports async generation with
createAsync()andgenerateAsync(). See the async example and play async example.