Kokoro

This page lists pre-trained models from https://huggingface.co/hexgrad/Kokoro-82M.

kokoro-en-v0_19 (English, 11 speakers)

This model contains 11 speakers. The ONNX model is from https://github.com/thewh1teagle/kokoro-onnx/releases/tag/model-files

The script for adding meta data to the ONNX model can be found at https://github.com/k2-fsa/sherpa-onnx/tree/master/scripts/kokoro

In the following, we describe how to download it and use it with sherpa-onnx.

Download the model

Please use the following commands to download it.

cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/kokoro-en-v0_19.tar.bz2
tar xf kokoro-en-v0_19.tar.bz2
rm kokoro-en-v0_19.tar.bz2

Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.

ls -lh kokoro-en-v0_19/

total 686208
-rw-r--r--    1 fangjun  staff    11K Jan 15 16:23 LICENSE
-rw-r--r--    1 fangjun  staff   235B Jan 15 16:25 README.md
drwxr-xr-x  122 fangjun  staff   3.8K Nov 28  2023 espeak-ng-data
-rw-r--r--    1 fangjun  staff   330M Jan 15 16:25 model.onnx
-rw-r--r--    1 fangjun  staff   1.1K Jan 15 16:25 tokens.txt
-rw-r--r--    1 fangjun  staff   5.5M Jan 15 16:25 voices.bin

Map between speaker ID and speaker name

The model contains 11 speakers and we use integer IDs 0-10 to represent. each speaker.

The map is given below:

Speaker ID

0

1

2

3

4

5

6

7

8

9

10

Speaker Name

af

af_bella

af_nicole

af_sarah

af_sky

am_adam

am_michael

bf_emma

bf_isabella

bm_george

bm_lewis

ID name Test wave
0 af
1 af_bella
2 af_nicole
3 af_sarah
4 af_sky
5 am_adam
6 am_michael
7 bf_emma
8 bf_isabella
9 bm_george
10 bm_lewis

Generate speech with executables compiled from C++

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline-tts \
  --kokoro-model=./kokoro-en-v0_19/model.onnx \
  --kokoro-voices=./kokoro-en-v0_19/voices.bin \
  --kokoro-tokens=./kokoro-en-v0_19/tokens.txt \
  --kokoro-data-dir=./kokoro-en-v0_19/espeak-ng-data \
  --num-threads=2 \
  --sid=10 \
  --output-filename="./10-bm_lewis.wav" \
  "Today as always, men fall into two groups: slaves and free men. Whoever does not have two-thirds of his day for himself, is a slave, whatever he may be, a statesman, a businessman, an official, or a scholar."

After running, it will generate a file 10-bm_lewis in the current directory.

soxi ./10-bm_lewis.wav

Input File     : './10-bm_lewis.wav'
Channels       : 1
Sample Rate    : 24000
Precision      : 16-bit
Duration       : 00:00:15.80 = 379200 samples ~ 1185 CDDA sectors
File Size      : 758k
Bit Rate       : 384k
Sample Encoding: 16-bit Signed Integer PCM

Hint

Sample rate of this model is fixed to 24000 Hz.

Wave filename Content Text
10-bm_lewis.wav "Today as always, men fall into two groups: slaves and free men. Whoever does not have two-thirds of his day for himself, is a slave, whatever he may be, a statesman, a businessman, an official, or a scholar."

Generate speech with Python script

cd /path/to/sherpa-onnx

python3 ./python-api-examples/offline-tts.py \
  --kokoro-model=./kokoro-en-v0_19/model.onnx \
  --kokoro-voices=./kokoro-en-v0_19/voices.bin \
  --kokoro-tokens=./kokoro-en-v0_19/tokens.txt \
  --kokoro-data-dir=./kokoro-en-v0_19/espeak-ng-data \
  --num-threads=2 \
  --sid=2 \
  --output-filename=./2-af_nicole.wav \
  "Friends fell out often because life was changing so fast. The easiest thing in the world was to lose touch with someone."
soxi ./2-af_nicole.wav

Input File     : './2-af_nicole.wav'
Channels       : 1
Sample Rate    : 24000
Precision      : 16-bit
Duration       : 00:00:11.45 = 274800 samples ~ 858.75 CDDA sectors
File Size      : 550k
Bit Rate       : 384k
Sample Encoding: 16-bit Signed Integer PCM
Wave filename Content Text
2-af_nicole.wav "Friends fell out often because life was changing so fast. The easiest thing in the world was to lose touch with someone."

RTF on Raspberry Pi 4 Model B Rev 1.5

We use the following command to test the RTF of this model on Raspberry Pi 4 Model B Rev 1.5:

for t in 1 2 3 4; do
 build/bin/sherpa-onnx-offline-tts \
   --num-threads=$t \
   --kokoro-model=./kokoro-en-v0_19/model.onnx \
   --kokoro-voices=./kokoro-en-v0_19/voices.bin \
   --kokoro-tokens=./kokoro-en-v0_19/tokens.txt \
   --kokoro-data-dir=./kokoro-en-v0_19/espeak-ng-data \
   --sid=2 \
   --output-filename=./2-af_nicole.wav \
   "Friends fell out often because life was changing so fast. The easiest thing in the world was to lose touch with someone."
done

The results are given below:

num_threads

1

2

3

4

RTF

6.629

3.870

2.999

2.774