vits

This page lists pre-trained vits models.

All models in a single table

The following table summarizes the information of all models in this page.

Note

Since there are more than 100 pre-trained models for over 40 languages, we don’t list all of them on this page. Please find them at https://github.com/k2-fsa/sherpa-onnx/releases/tag/tts-models.

You can try all the models at the following huggingface space. https://huggingface.co/spaces/k2-fsa/text-to-speech.

Hint

You can find Android APKs for each model at the following page

https://k2-fsa.github.io/sherpa/onnx/tts/apk.html

Model

Language

# Speakers

Dataset

Model filesize (MB)

Sample rate (Hz)

vits-melo-tts-zh_en (Chinese + English, 1 speaker)

Chinese + English

1

N/A

163

44100

vits-piper-en_US-libritts_r-medium (English, 904 speakers)

English

904

LibriTTS-R

75

22050

vits-piper-en_US-glados (English, 1 speaker)

English

1

N/A

61

22050

csukuangfj/sherpa-onnx-vits-zh-ll (Chinese, 5 speakers)

Chinese

5

N/A

115

16000

csukuangfj/vits-zh-hf-fanchen-C (Chinese, 187 speakers)

Chinese

187

N/A

116

16000

csukuangfj/vits-zh-hf-fanchen-wnj (Chinese, 1 male)

Chinese

1

N/A

116

16000

csukuangfj/vits-zh-hf-theresa (Chinese, 804 speakers)

Chinese

804

N/A

117

22050

csukuangfj/vits-zh-hf-eula (Chinese, 804 speakers)

Chinese

804

N/A

117

22050

aishell3 (Chinese, multi-speaker, 174 speakers)

Chinese

174

aishell3

116

8000

ljspeech (English, single-speaker)

English (US)

1 (Female)

LJ Speech

109

22050

VCTK (English, multi-speaker, 109 speakers)

English

109

VCTK

116

22050

en_US-lessac-medium (English, single-speaker)

English (US)

1 (Male)

lessac_blizzard2013

61

22050

vits-melo-tts-zh_en (Chinese + English, 1 speaker)

This model is converted from https://huggingface.co/myshell-ai/MeloTTS-Chinese and it supports only 1 speaker. It supports both Chinese and English.

Note that if you input English words, only those that are present in the lexicon.txt can be pronounced. Please refer to https://github.com/k2-fsa/sherpa-onnx/pull/1209 for how to add new words.

Hint

The converting script is available at https://github.com/k2-fsa/sherpa-onnx/tree/master/scripts/melo-tts

You can convert more models from https://github.com/myshell-ai/MeloTTS by yourself.

In the following, we describe how to download it and use it with sherpa-onnx.

Download the model

Please use the following commands to download it.

cd /path/to/sherpa-onnx

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-melo-tts-zh_en.tar.bz2
tar xvf vits-melo-tts-zh_en.tar.bz2
rm vits-melo-tts-zh_en.tar.bz2

Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.

ls -lh vits-melo-tts-zh_en/
total 346848
-rw-r--r--  1 fangjun  staff   1.0K Jul 16 13:38 LICENSE
-rw-r--r--  1 fangjun  staff   156B Jul 16 13:38 README.md
-rw-r--r--  1 fangjun  staff    58K Jul 16 13:38 date.fst
drwxr-xr-x  9 fangjun  staff   288B Apr 19 20:42 dict
-rw-r--r--  1 fangjun  staff   6.5M Jul 16 13:38 lexicon.txt
-rw-r--r--  1 fangjun  staff   163M Jul 16 13:38 model.onnx
-rw-r--r--  1 fangjun  staff    63K Jul 16 13:38 number.fst
-rw-r--r--  1 fangjun  staff    87K Jul 16 13:38 phone.fst
-rw-r--r--  1 fangjun  staff   655B Jul 16 13:38 tokens.txt

Generate speech with executables compiled from C++

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline-tts \
 --vits-model=./vits-melo-tts-zh_en/model.onnx \
 --vits-lexicon=./vits-melo-tts-zh_en/lexicon.txt \
 --vits-tokens=./vits-melo-tts-zh_en/tokens.txt \
 --vits-dict-dir=./vits-melo-tts-zh_en/dict \
 --output-filename=./zh-en-0.wav \
 "This is a 中英文的 text to speech 测试例子。"

./build/bin/sherpa-onnx-offline-tts \
 --vits-model=./vits-melo-tts-zh_en/model.onnx \
 --vits-lexicon=./vits-melo-tts-zh_en/lexicon.txt \
 --vits-tokens=./vits-melo-tts-zh_en/tokens.txt \
 --vits-dict-dir=./vits-melo-tts-zh_en/dict \
 --output-filename=./zh-en-1.wav \
 "我最近在学习machine learning,希望能够在未来的artificial intelligence领域有所建树。"

./build/bin/sherpa-onnx-offline-tts-play \
 --vits-model=./vits-melo-tts-zh_en/model.onnx \
 --vits-lexicon=./vits-melo-tts-zh_en/lexicon.txt \
 --vits-tokens=./vits-melo-tts-zh_en/tokens.txt \
 --tts-rule-fsts="./vits-melo-tts-zh_en/date.fst,./vits-melo-tts-zh_en/number.fst" \
 --vits-dict-dir=./vits-melo-tts-zh_en/dict \
 --output-filename=./zh-en-2.wav \
 "Are you ok 是雷军2015年4月小米在印度举行新品发布会时说的。他还说过 I am very happy to be in China.雷军事后在微博上表示「万万没想到,视频火速传到国内,全国人民都笑了」、「现在国际米粉越来越多,我的确应该把英文学好,不让大家失望!加油!」"

After running, it will generate three files zh-en-1.wav, zh-en-2.wav, and zh-en-3.wav in the current directory.

soxi zh-en-*.wav

Input File     : 'zh-en-0.wav'
Channels       : 1
Sample Rate    : 44100
Precision      : 16-bit
Duration       : 00:00:03.54 = 156160 samples = 265.578 CDDA sectors
File Size      : 312k
Bit Rate       : 706k
Sample Encoding: 16-bit Signed Integer PCM


Input File     : 'zh-en-1.wav'
Channels       : 1
Sample Rate    : 44100
Precision      : 16-bit
Duration       : 00:00:05.98 = 263680 samples = 448.435 CDDA sectors
File Size      : 527k
Bit Rate       : 706k
Sample Encoding: 16-bit Signed Integer PCM


Input File     : 'zh-en-2.wav'
Channels       : 1
Sample Rate    : 44100
Precision      : 16-bit
Duration       : 00:00:18.92 = 834560 samples = 1419.32 CDDA sectors
File Size      : 1.67M
Bit Rate       : 706k
Sample Encoding: 16-bit Signed Integer PCM

Total Duration of 3 files: 00:00:28.44
Wave filename Content Text
zh-en-0.wav This is a 中英文的 text to speech 测试例子。
zh-en-1.wav 我最近在学习machine learning,希望能够在未来的artificial intelligence领域有所建树。
zh-en-2.wav Are you ok 是雷军2015年4月小米在印度举行新品发布会时说的。他还说过 I am very happy to be in China.雷军事后在微博上表示「万万没想到,视频火速传到国内,全国人民都笑了」、「现在国际米粉越来越多,我的确应该把英文学好,不让大家失望!加油!」

Generate speech with Python script

cd /path/to/sherpa-onnx

python3 ./python-api-examples/offline-tts-play.py \
 --vits-model=./vits-melo-tts-zh_en/model.onnx \
 --vits-lexicon=./vits-melo-tts-zh_en/lexicon.txt \
 --vits-tokens=./vits-melo-tts-zh_en/tokens.txt \
 --vits-dict-dir=./vits-melo-tts-zh_en/dict \
 --output-filename=./zh-en-3.wav \
 "它也支持繁体字. 我相信你們一定聽過愛迪生說過的這句話Genius is one percent inspiration and ninety-nine percent perspiration. "

After running, it will generate a file zh-en-3.wav in the current directory.

soxi zh-en-3.wav

Input File     : 'zh-en-3.wav'
Channels       : 1
Sample Rate    : 44100
Precision      : 16-bit
Duration       : 00:00:09.83 = 433664 samples = 737.524 CDDA sectors
File Size      : 867k
Bit Rate       : 706k
Sample Encoding: 16-bit Signed Integer PCM
Wave filename Content Text
zh-en-3.wav 它也支持繁体字. 我相信你們一定聽過愛迪生說過的這句話Genius is one percent inspiration and ninety-nine percent perspiration.

RTF on Raspberry Pi 4 Model B Rev 1.5

We use the following command to test the RTF of this model on Raspberry Pi 4 Model B Rev 1.5:

for t in 1 2 3 4; do
 ./build/bin/sherpa-onnx-offline-tts \
   --num-threads=$t \
   --vits-model=./vits-melo-tts-zh_en/model.onnx \
   --vits-lexicon=./vits-melo-tts-zh_en/lexicon.txt \
   --vits-tokens=./vits-melo-tts-zh_en/tokens.txt \
   --vits-dict-dir=./vits-melo-tts-zh_en/dict \
   "当夜幕降临,星光点点,伴随着微风拂面,我在静谧中感受着时光的流转,思念如涟漪荡漾,梦境如画卷展开,我与自然融为一体,沉静在这片宁静的美丽之中,感受着生命的奇迹与 温柔."
done

The results are given below:

num_threads

1

2

3

4

RTF

6.727

3.877

2.914

2.518

vits-piper-en_US-glados (English, 1 speaker)

This model is converted from https://github.com/dnhkng/Glados/raw/main/models/glados.onnx and it supports only English.

See also https://github.com/dnhkng/GlaDOS .

If you are interested in how the model is converted to sherpa-onnx, please see the following colab notebook:

In the following, we describe how to download it and use it with sherpa-onnx.

Download the model

Please use the following commands to download it.

cd /path/to/sherpa-onnx

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-piper-en_US-glados.tar.bz2
tar xvf vits-piper-en_US-glados.tar.bz2
rm vits-piper-en_US-glados.tar.bz2

Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.

ls -lh vits-piper-en_US-glados/

-rw-r--r--    1 fangjun  staff   242B Dec 13  2023 README.md
-rw-r--r--    1 fangjun  staff    61M Dec 13  2023 en_US-glados.onnx
drwxr-xr-x  122 fangjun  staff   3.8K Dec 13  2023 espeak-ng-data
-rw-r--r--    1 fangjun  staff   940B Dec 13  2023 tokens.txt

Generate speech with executables compiled from C++

 cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline-tts \
  --vits-model=./vits-piper-en_US-glados/en_US-glados.onnx\
  --vits-tokens=./vits-piper-en_US-glados/tokens.txt \
  --vits-data-dir=./vits-piper-en_US-glados/espeak-ng-data \
  --output-filename=./glados-liliana.wav \
  "liliana, the most beautiful and lovely assistant of our team!"

./build/bin/sherpa-onnx-offline-tts \
  --vits-model=./vits-piper-en_US-glados/en_US-glados.onnx\
  --vits-tokens=./vits-piper-en_US-glados/tokens.txt \
  --vits-data-dir=./vits-piper-en_US-glados/espeak-ng-data \
  --output-filename=./glados-code.wav \
  "Talk is cheap. Show me the code."

./build/bin/sherpa-onnx-offline-tts \
  --vits-model=./vits-piper-en_US-glados/en_US-glados.onnx\
  --vits-tokens=./vits-piper-en_US-glados/tokens.txt \
  --vits-data-dir=./vits-piper-en_US-glados/espeak-ng-data \
  --output-filename=./glados-men.wav \
   "Today as always, men fall into two groups: slaves and free men. Whoever does not have two-thirds of his day for himself, is a slave, whatever he may be: a statesman, a businessman, an official, or a scholar."

After running, it will generate 3 files glados-liliana.wav, glados-code.wav, and glados-men.wav in the current directory.

soxi glados*.wav

Input File     : 'glados-code.wav'
Channels       : 1
Sample Rate    : 22050
Precision      : 16-bit
Duration       : 00:00:02.18 = 48128 samples ~ 163.701 CDDA sectors
File Size      : 96.3k
Bit Rate       : 353k
Sample Encoding: 16-bit Signed Integer PCM


Input File     : 'glados-liliana.wav'
Channels       : 1
Sample Rate    : 22050
Precision      : 16-bit
Duration       : 00:00:03.97 = 87552 samples ~ 297.796 CDDA sectors
File Size      : 175k
Bit Rate       : 353k
Sample Encoding: 16-bit Signed Integer PCM


Input File     : 'glados-men.wav'
Channels       : 1
Sample Rate    : 22050
Precision      : 16-bit
Duration       : 00:00:15.31 = 337664 samples ~ 1148.52 CDDA sectors
File Size      : 675k
Bit Rate       : 353k
Sample Encoding: 16-bit Signed Integer PCM

Total Duration of 3 files: 00:00:21.47
Wave filename Content Text
glados-liliana.wav liliana, the most beautiful and lovely assistant of our team!
glados-code.wav Talk is cheap. Show me the code.
glados-men.wav Today as always, men fall into two groups: slaves and free men. Whoever does not have two-thirds of his day for himself, is a slave, whatever he may be: a statesman, a businessman, an official, or a scholar.

Generate speech with Python script

cd /path/to/sherpa-onnx

python3 ./python-api-examples/offline-tts.py \
 --vits-model=./vits-piper-en_US-glados/en_US-glados.onnx\
 --vits-tokens=./vits-piper-en_US-glados/tokens.txt \
 --vits-data-dir=./vits-piper-en_US-glados/espeak-ng-data \
 --output-filename=./glados-ship.wav \
 "A ship in port is safe, but that's not what ships are built for."

python3 ./python-api-examples/offline-tts.py \
 --vits-model=./vits-piper-en_US-glados/en_US-glados.onnx\
 --vits-tokens=./vits-piper-en_US-glados/tokens.txt \
 --vits-data-dir=./vits-piper-en_US-glados/espeak-ng-data \
 --output-filename=./glados-bug.wav \
 "Given enough eyeballs, all bugs are shallow."

After running, it will generate two files glados-ship.wav and glados-bug.wav in the current directory.

soxi ./glados-{ship,bug}.wav

Input File     : './glados-ship.wav'
Channels       : 1
Sample Rate    : 22050
Precision      : 16-bit
Duration       : 00:00:03.74 = 82432 samples ~ 280.381 CDDA sectors
File Size      : 165k
Bit Rate       : 353k
Sample Encoding: 16-bit Signed Integer PCM


Input File     : './glados-bug.wav'
Channels       : 1
Sample Rate    : 22050
Precision      : 16-bit
Duration       : 00:00:02.67 = 58880 samples ~ 200.272 CDDA sectors
File Size      : 118k
Bit Rate       : 353k
Sample Encoding: 16-bit Signed Integer PCM

Total Duration of 2 files: 00:00:06.41
Wave filename Content Text
glados-ship.wav A ship in port is safe, but that's not what ships are built for.
glados-bug.wav Given enough eyeballs, all bugs are shallow.

RTF on Raspberry Pi 4 Model B Rev 1.5

We use the following command to test the RTF of this model on Raspberry Pi 4 Model B Rev 1.5:

for t in 1 2 3 4; do
 ./build/bin/sherpa-onnx-offline-tts \
   --num-threads=$t \
   --vits-model=./vits-piper-en_US-glados/en_US-glados.onnx\
   --vits-tokens=./vits-piper-en_US-glados/tokens.txt \
   --vits-data-dir=./vits-piper-en_US-glados/espeak-ng-data \
   "Friends fell out often because life was changing so fast. The easiest thing in the world was to lose touch with someone."
done

The results are given below:

num_threads

1

2

3

4

RTF

0.812

0.480

0.391

0.349

vits-piper-en_US-libritts_r-medium (English, 904 speakers)

This model is converted from https://huggingface.co/rhasspy/piper-voices/tree/main/en/en_US/libritts_r/medium and it supports 904 speakers. It supports only English.

In the following, we describe how to download it and use it with sherpa-onnx.

Download the model

Please use the following commands to download it.

cd /path/to/sherpa-onnx

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-piper-en_US-libritts_r-medium.tar.bz2
tar xvf vits-piper-en_US-libritts_r-medium.tar.bz2
rm vits-piper-en_US-libritts_r-medium.tar.bz2

Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.

ls -lh vits-piper-en_US-libritts_r-medium/
total 153552
-rw-r--r--    1 fangjun  staff   279B Nov 29  2023 MODEL_CARD
-rw-r--r--    1 fangjun  staff    75M Nov 29  2023 en_US-libritts_r-medium.onnx
-rw-r--r--    1 fangjun  staff    20K Nov 29  2023 en_US-libritts_r-medium.onnx.json
drwxr-xr-x  122 fangjun  staff   3.8K Nov 28  2023 espeak-ng-data
-rw-r--r--    1 fangjun  staff   954B Nov 29  2023 tokens.txt
-rwxr-xr-x    1 fangjun  staff   1.8K Nov 29  2023 vits-piper-en_US.py
-rwxr-xr-x    1 fangjun  staff   730B Nov 29  2023 vits-piper-en_US.sh

Generate speech with executables compiled from C++

 cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline-tts \
  --vits-model=./vits-piper-en_US-libritts_r-medium/en_US-libritts_r-medium.onnx \
  --vits-tokens=./vits-piper-en_US-libritts_r-medium/tokens.txt \
  --vits-data-dir=./vits-piper-en_US-libritts_r-medium/espeak-ng-data \
  --output-filename=./libritts-liliana-109.wav \
  --sid=109 \
  "liliana, the most beautiful and lovely assistant of our team!"

./build/bin/sherpa-onnx-offline-tts \
  --vits-model=./vits-piper-en_US-libritts_r-medium/en_US-libritts_r-medium.onnx \
  --vits-tokens=./vits-piper-en_US-libritts_r-medium/tokens.txt \
  --vits-data-dir=./vits-piper-en_US-libritts_r-medium/espeak-ng-data \
  --output-filename=./libritts-liliana-900.wav \
  --sid=900 \
  "liliana, the most beautiful and lovely assistant of our team!"

After running, it will generate two files libritts-liliana-109.wav and libritts-liliana-900.wav in the current directory.

soxi libritts-liliana-*.wav

Input File     : 'libritts-liliana-109.wav'
Channels       : 1
Sample Rate    : 22050
Precision      : 16-bit
Duration       : 00:00:02.73 = 60160 samples ~ 204.626 CDDA sectors
File Size      : 120k
Bit Rate       : 353k
Sample Encoding: 16-bit Signed Integer PCM


Input File     : 'libritts-liliana-900.wav'
Channels       : 1
Sample Rate    : 22050
Precision      : 16-bit
Duration       : 00:00:03.36 = 73984 samples ~ 251.646 CDDA sectors
File Size      : 148k
Bit Rate       : 353k
Sample Encoding: 16-bit Signed Integer PCM

Total Duration of 2 files: 00:00:06.08
Wave filename Content Text
libritts-liliana-109.wav liliana, the most beautiful and lovely assistant of our team!
libritts-liliana-900.wav liliana, the most beautiful and lovely assistant of our team!

Generate speech with Python script

cd /path/to/sherpa-onnx

python3 ./python-api-examples/offline-tts.py \
 --vits-model=./vits-piper-en_US-libritts_r-medium/en_US-libritts_r-medium.onnx \
 --vits-tokens=./vits-piper-en_US-libritts_r-medium/tokens.txt \
 --vits-data-dir=./vits-piper-en_US-libritts_r-medium/espeak-ng-data \
 --sid=200 \
 --output-filename=./libritts-armstrong-200.wav \
 "That's one small step for a man, a giant leap for mankind."

python3 ./python-api-examples/offline-tts.py \
 --vits-model=./vits-piper-en_US-libritts_r-medium/en_US-libritts_r-medium.onnx \
 --vits-tokens=./vits-piper-en_US-libritts_r-medium/tokens.txt \
 --vits-data-dir=./vits-piper-en_US-libritts_r-medium/espeak-ng-data \
 --sid=500 \
 --output-filename=./libritts-armstrong-500.wav \
 "That's one small step for a man, a giant leap for mankind."

After running, it will generate two files libritts-armstrong-200.wav and libritts-armstrong-500.wav in the current directory.

soxi ./libritts-armstrong*.wav

Input File     : './libritts-armstrong-200.wav'
Channels       : 1
Sample Rate    : 22050
Precision      : 16-bit
Duration       : 00:00:03.11 = 68608 samples ~ 233.361 CDDA sectors
File Size      : 137k
Bit Rate       : 353k
Sample Encoding: 16-bit Signed Integer PCM


Input File     : './libritts-armstrong-500.wav'
Channels       : 1
Sample Rate    : 22050
Precision      : 16-bit
Duration       : 00:00:03.42 = 75520 samples ~ 256.871 CDDA sectors
File Size      : 151k
Bit Rate       : 353k
Sample Encoding: 16-bit Signed Integer PCM

Total Duration of 2 files: 00:00:06.54
Wave filename Content Text
libritts-armstrong-200.wav That's one small step for a man, a giant leap for mankind.
libritts-armstrong-500.wav That's one small step for a man, a giant leap for mankind.

RTF on Raspberry Pi 4 Model B Rev 1.5

We use the following command to test the RTF of this model on Raspberry Pi 4 Model B Rev 1.5:

for t in 1 2 3 4; do
 ./build/bin/sherpa-onnx-offline-tts \
   --num-threads=$t \
   --vits-model=./vits-piper-en_US-libritts_r-medium/en_US-libritts_r-medium.onnx \
   --vits-tokens=./vits-piper-en_US-libritts_r-medium/tokens.txt \
   --vits-data-dir=./vits-piper-en_US-libritts_r-medium/espeak-ng-data \
   "Friends fell out often because life was changing so fast. The easiest thing in the world was to lose touch with someone."
done

The results are given below:

num_threads

1

2

3

4

RTF

0.790

0.493

0.392

0.357

ljspeech (English, single-speaker)

This model is converted from pretrained_ljspeech.pth, which is trained by the vits author Jaehyeon Kim on the LJ Speech dataset. It supports only English and is a single-speaker model.

Note

If you are interested in how the model is converted, please see https://github.com/k2-fsa/sherpa-onnx/blob/master/scripts/vits/export-onnx-ljs.py

In the following, we describe how to download it and use it with sherpa-onnx.

Download the model

Please use the following commands to download it.

cd /path/to/sherpa-onnx

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-ljs.tar.bz2
tar xvf vits-ljs.tar.bz2
rm vits-ljs.tar.bz2

Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.

-rw-r--r-- 1 1001 127 109M Apr 22 02:38 vits-ljs/vits-ljs.onnx

Generate speech with executables compiled from C++

 cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline-tts \
  --vits-model=./vits-ljs/vits-ljs.onnx \
  --vits-lexicon=./vits-ljs/lexicon.txt \
  --vits-tokens=./vits-ljs/tokens.txt \
  --output-filename=./liliana.wav \
  "liliana, the most beautiful and lovely assistant of our team!"

After running, it will generate a file liliana.wav in the current directory.

soxi ./liliana.wav

Input File     : './liliana.wav'
Channels       : 1
Sample Rate    : 22050
Precision      : 16-bit
Duration       : 00:00:04.39 = 96768 samples ~ 329.143 CDDA sectors
File Size      : 194k
Bit Rate       : 353k
Sample Encoding: 16-bit Signed Integer PCM
Wave filename Content Text
liliana.wav liliana, the most beautiful and lovely assistant of our team!

Generate speech with Python script

cd /path/to/sherpa-onnx

python3 ./python-api-examples/offline-tts.py \
 --vits-model=./vits-ljs/vits-ljs.onnx \
 --vits-lexicon=./vits-ljs/lexicon.txt \
 --vits-tokens=./vits-ljs/tokens.txt \
 --output-filename=./armstrong.wav \
 "That's one small step for a man, a giant leap for mankind."

After running, it will generate a file armstrong.wav in the current directory.

soxi ./armstrong.wav

Input File     : './armstrong.wav'
Channels       : 1
Sample Rate    : 22050
Precision      : 16-bit
Duration       : 00:00:04.81 = 105984 samples ~ 360.49 CDDA sectors
File Size      : 212k
Bit Rate       : 353k
Sample Encoding: 16-bit Signed Integer PCM
Wave filename Content Text
armstrong.wav That's one small step for a man, a giant leap for mankind.

RTF on Raspberry Pi 4 Model B Rev 1.5

We use the following command to test the RTF of this model on Raspberry Pi 4 Model B Rev 1.5:

for t in 1 2 3 4; do
 ./build/bin/sherpa-onnx-offline-tts \
   --num-threads=$t \
   --vits-model=./vits-ljs/vits-ljs.onnx \
   --vits-lexicon=./vits-ljs/lexicon.txt \
   --vits-tokens=./vits-ljs/tokens.txt \
   "Friends fell out often because life was changing so fast. The easiest thing in the world was to lose touch with someone."
done

The results are given below:

num_threads

1

2

3

4

RTF

6.057

3.517

2.535

2.206

VCTK (English, multi-speaker, 109 speakers)

This model is converted from pretrained_vctk.pth, which is trained by the vits author Jaehyeon Kim on the VCTK dataset. It supports only English and is a multi-speaker model. It contains 109 speakers.

Note

If you are interested in how the model is converted, please see https://github.com/k2-fsa/sherpa-onnx/blob/master/scripts/vits/export-onnx-vctk.py

In the following, we describe how to download it and use it with sherpa-onnx.

Download the model

Please use the following commands to download it.

cd /path/to/sherpa-onnx

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-vctk.tar.bz2
tar xvf vits-vctk.tar.bz2
rm vits-vctk.tar.bz2

Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.

vits-vctk fangjun$ ls -lh *.onnx
-rw-r--r--  1 fangjun  staff    37M Oct 16 10:57 vits-vctk.int8.onnx
-rw-r--r--  1 fangjun  staff   116M Oct 16 10:57 vits-vctk.onnx

Generate speech with executables compiled from C++

Since there are 109 speakers available, we can choose a speaker from 0 to 198. The default speaker ID is 0.

We use speaker ID 0, 10, and 108 below to generate audio for the same text.

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline-tts \
  --vits-model=./vits-vctk/vits-vctk.onnx \
  --vits-lexicon=./vits-vctk/lexicon.txt \
  --vits-tokens=./vits-vctk/tokens.txt \
  --sid=0 \
  --output-filename=./kennedy-0.wav \
  "Ask not what your country can do for you; ask what you can do for your country."

./build/bin/sherpa-onnx-offline-tts \
  --vits-model=./vits-vctk/vits-vctk.onnx \
  --vits-lexicon=./vits-vctk/lexicon.txt \
  --vits-tokens=./vits-vctk/tokens.txt \
  --sid=10 \
  --output-filename=./kennedy-10.wav \
  "Ask not what your country can do for you; ask what you can do for your country."

./build/bin/sherpa-onnx-offline-tts \
  --vits-model=./vits-vctk/vits-vctk.onnx \
  --vits-lexicon=./vits-vctk/lexicon.txt \
  --vits-tokens=./vits-vctk/tokens.txt \
  --sid=108 \
  --output-filename=./kennedy-108.wav \
  "Ask not what your country can do for you; ask what you can do for your country."

It will generate 3 files: kennedy-0.wav, kennedy-10.wav, and kennedy-108.wav.

Wave filename Content Text
kennedy-0.wav Ask not what your country can do for you; ask what you can do for your country.
kennedy-10.wav Ask not what your country can do for you; ask what you can do for your country.
kennedy-108.wav Ask not what your country can do for you; ask what you can do for your country.

Generate speech with Python script

We use speaker ID 30, 66, and 99 below to generate audio for different transcripts.

cd /path/to/sherpa-onnx

python3 ./python-api-examples/offline-tts.py \
 --vits-model=./vits-vctk/vits-vctk.onnx \
 --vits-lexicon=./vits-vctk/lexicon.txt \
 --vits-tokens=./vits-vctk/tokens.txt \
 --sid=30 \
 --output-filename=./einstein-30.wav \
 "Life is like riding a bicycle. To keep your balance, you must keep moving."

python3 ./python-api-examples/offline-tts.py \
 --vits-model=./vits-vctk/vits-vctk.onnx \
 --vits-lexicon=./vits-vctk/lexicon.txt \
 --vits-tokens=./vits-vctk/tokens.txt \
 --sid=66 \
 --output-filename=./franklin-66.wav \
 "Three can keep a secret, if two of them are dead."

python3 ./python-api-examples/offline-tts.py \
 --vits-model=./vits-vctk/vits-vctk.onnx \
 --vits-lexicon=./vits-vctk/lexicon.txt \
 --vits-tokens=./vits-vctk/tokens.txt \
 --sid=99 \
 --output-filename=./martin-99.wav \
 "Darkness cannot drive out darkness: only light can do that. Hate cannot drive out hate: only love can do that"

It will generate 3 files: einstein-30.wav, franklin-66.wav, and martin-99.wav.

Wave filename Content Text
einstein-30.wav Life is like riding a bicycle. To keep your balance, you must keep moving.
franklin-66.wav Three can keep a secret, if two of them are dead.
martin-99.wav Darkness cannot drive out darkness: only light can do that. Hate cannot drive out hate: only love can do that

RTF on Raspberry Pi 4 Model B Rev 1.5

We use the following command to test the RTF of this model on Raspberry Pi 4 Model B Rev 1.5:

for t in 1 2 3 4; do
 ./build/bin/sherpa-onnx-offline-tts \
   --num-threads=$t \
   --vits-model=./vits-vctk/vits-vctk.onnx \
   --vits-lexicon=./vits-vctk/lexicon.txt \
   --vits-tokens=./vits-vctk/tokens.txt \
   "Friends fell out often because life was changing so fast. The easiest thing in the world was to lose touch with someone."
done

The results are given below:

num_threads

1

2

3

4

RTF

6.079

3.483

2.537

2.226

csukuangfj/sherpa-onnx-vits-zh-ll (Chinese, 5 speakers)

You can download the model using the following commands:

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/sherpa-onnx-vits-zh-ll.tar.bz2
tar xvf sherpa-onnx-vits-zh-ll.tar.bz2
rm sherpa-onnx-vits-zh-ll.tar.bz2

Hint

This model is trained with the following framework

Please check the file sizes of the downloaded model:

ls -lh sherpa-onnx-vits-zh-ll/

-rw-r--r--  1 fangjun  staff   2.3K Apr 25 17:58 G_multisperaker_latest.json
-rw-r-----@ 1 fangjun  staff   2.2K Apr 25 17:22 G_multisperaker_latest_low.json
-rw-r--r--  1 fangjun  staff   127B Apr 25 17:58 README.md
-rw-r--r--  1 fangjun  staff    58K Apr 25 17:58 date.fst
drwxr-xr-x  9 fangjun  staff   288B Jun 21 16:32 dict
-rw-r--r--  1 fangjun  staff   368K Apr 25 17:58 lexicon.txt
-rw-r--r--  1 fangjun  staff   115M Apr 25 17:58 model.onnx
-rw-r--r--  1 fangjun  staff    21K Apr 25 17:58 new_heteronym.fst
-rw-r--r--  1 fangjun  staff    63K Apr 25 17:58 number.fst
-rw-r--r--  1 fangjun  staff    87K Apr 25 17:58 phone.fst
-rw-r--r--  1 fangjun  staff   331B Apr 25 17:58 tokens.txt

usage:

sherpa-onnx-offline-tts \
  --vits-model=./sherpa-onnx-vits-zh-ll/model.onnx \
  --vits-dict-dir=./sherpa-onnx-vits-zh-ll/dict \
  --vits-lexicon=./sherpa-onnx-vits-zh-ll/lexicon.txt \
  --vits-tokens=./sherpa-onnx-vits-zh-ll/tokens.txt \
  --vits-length-scale=0.5 \
  --sid=0 \
  --output-filename="./0-value-2x.wav" \
  "小米的核心价值观是什么?答案是真诚热爱!"


sherpa-onnx-offline-tts \
  --vits-model=./sherpa-onnx-vits-zh-ll/model.onnx \
  --vits-dict-dir=./sherpa-onnx-vits-zh-ll/dict \
  --vits-lexicon=./sherpa-onnx-vits-zh-ll/lexicon.txt \
  --vits-tokens=./sherpa-onnx-vits-zh-ll/tokens.txt \
  --sid=1 \
  --tts-rule-fsts=./sherpa-onnx-vits-zh-ll/number.fst \
  --output-filename="./1-numbers.wav" \
  "小米有14岁了"

sherpa-onnx-offline-tts \
  --vits-model=./sherpa-onnx-vits-zh-ll/model.onnx \
  --vits-dict-dir=./sherpa-onnx-vits-zh-ll/dict \
  --vits-lexicon=./sherpa-onnx-vits-zh-ll/lexicon.txt \
  --vits-tokens=./sherpa-onnx-vits-zh-ll/tokens.txt \
  --tts-rule-fsts=./sherpa-onnx-vits-zh-ll/phone.fst,./sherpa-onnx-vits-zh-ll/number.fst \
  --sid=2 \
  --output-filename="./2-numbers.wav" \
  "有困难,请拨打110 或者18601200909"

sherpa-onnx-offline-tts \
  --vits-model=./sherpa-onnx-vits-zh-ll/model.onnx \
  --vits-dict-dir=./sherpa-onnx-vits-zh-ll/dict \
  --vits-lexicon=./sherpa-onnx-vits-zh-ll/lexicon.txt \
  --vits-tokens=./sherpa-onnx-vits-zh-ll/tokens.txt \
  --sid=3 \
  --output-filename="./3-wo-mi.wav" \
  "小米的使命是,始终坚持做感动人心、价格厚道的好产品,让全球每个人都能享受科技带来的美好生活。"

sherpa-onnx-offline-tts \
  --vits-model=./sherpa-onnx-vits-zh-ll/model.onnx \
  --vits-dict-dir=./sherpa-onnx-vits-zh-ll/dict \
  --vits-lexicon=./sherpa-onnx-vits-zh-ll/lexicon.txt \
  --vits-tokens=./sherpa-onnx-vits-zh-ll/tokens.txt \
  --tts-rule-fsts=./sherpa-onnx-vits-zh-ll/number.fst \
  --sid=4 \
  --output-filename="./4-heteronym.wav" \
  "35年前,他于长沙出生, 在长白山长大。9年前他当上了银行的领导,主管行政。"
Wave filename Content Text
0-value-2x.wav 小米的核心价值观是什么?答案是真诚热爱!
1-numbers.wav 小米有14岁了
2-numbers.wav 有困难,请拨打110 或者18601200909
3-wo-mi.wav 小米的使命是,始终坚持做感动人心、价格厚道的好产品,让全球每个人都能享受科技带来的美好生活。
4-heteronym.wav 35年前,他于长沙出生, 在长白山长大。9年前他当上了银行的领导,主管行政。

RTF on Raspberry Pi 4 Model B Rev 1.5

We use the following command to test the RTF of this model on Raspberry Pi 4 Model B Rev 1.5:

for t in 1 2 3 4; do
 ./build/bin/sherpa-onnx-offline-tts \
   --num-threads=$t \
   --vits-model=./sherpa-onnx-vits-zh-ll/model.onnx \
   --vits-dict-dir=./sherpa-onnx-vits-zh-ll/dict \
   --vits-lexicon=./sherpa-onnx-vits-zh-ll/lexicon.txt \
   --vits-tokens=./sherpa-onnx-vits-zh-ll/tokens.txt \
   '当夜幕降临,星光点点,伴随着微风拂面,我在静谧中感受着时光的流转,思念如涟漪荡漾,梦境如画卷展开,我与自然融为一体,沉静在这片宁静的美丽之中,感受着生命的奇迹与温柔.'
done

The results are given below:

num_threads

1

2

3

4

RTF

4.275

2.494

1.840

1.593

csukuangfj/vits-zh-hf-fanchen-C (Chinese, 187 speakers)

You can download the model using the following commands:

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-zh-hf-fanchen-C.tar.bz2
tar xvf vits-zh-hf-fanchen-C.tar.bz2
rm vits-zh-hf-fanchen-C.tar.bz2
# information about model files

total 291M
-rw-r--r-- 1 1001 127  58K Apr 21 05:40 date.fst
drwxr-xr-x 3 1001 127 4.0K Apr 19 12:42 dict
-rwxr-xr-x 1 1001 127 4.0K Apr 21 05:40 export-onnx-zh-hf-fanchen-models.py
-rwxr-xr-x 1 1001 127 2.5K Apr 21 05:40 generate-lexicon-zh-hf-fanchen-models.py
-rw-r--r-- 1 1001 127 2.4M Apr 21 05:40 lexicon.txt
-rw-r--r-- 1 1001 127  22K Apr 21 05:40 new_heteronym.fst
-rw-r--r-- 1 1001 127  63K Apr 21 05:40 number.fst
-rw-r--r-- 1 1001 127  87K Apr 21 05:40 phone.fst
-rw-r--r-- 1 1001 127 173M Apr 21 05:40 rule.far
-rw-r--r-- 1 1001 127  331 Apr 21 05:40 tokens.txt
-rw-r--r-- 1 1001 127 116M Apr 21 05:40 vits-zh-hf-fanchen-C.onnx
-rwxr-xr-x 1 1001 127 2.0K Apr 21 05:40 vits-zh-hf-fanchen-models.sh

usage:

sherpa-onnx-offline-tts \
  --vits-model=./vits-zh-hf-fanchen-C/vits-zh-hf-fanchen-C.onnx \
  --vits-dict-dir=./vits-zh-hf-fanchen-C/dict \
  --vits-lexicon=./vits-zh-hf-fanchen-C/lexicon.txt \
  --vits-tokens=./vits-zh-hf-fanchen-C/tokens.txt \
  --vits-length-scale=0.5 \
  --output-filename="./value-2x.wav" \
  "小米的核心价值观是什么?答案是真诚热爱!"


sherpa-onnx-offline-tts \
  --vits-model=./vits-zh-hf-fanchen-C/vits-zh-hf-fanchen-C.onnx \
  --vits-dict-dir=./vits-zh-hf-fanchen-C/dict \
  --vits-lexicon=./vits-zh-hf-fanchen-C/lexicon.txt \
  --vits-tokens=./vits-zh-hf-fanchen-C/tokens.txt \
  --vits-length-scale=1.0 \
  --tts-rule-fsts=./vits-zh-hf-fanchen-C/number.fst \
  --output-filename="./numbers.wav" \
  "小米有14岁了"

sherpa-onnx-offline-tts \
  --sid=100 \
  --vits-model=./vits-zh-hf-fanchen-C/vits-zh-hf-fanchen-C.onnx \
  --vits-dict-dir=./vits-zh-hf-fanchen-C/dict \
  --vits-lexicon=./vits-zh-hf-fanchen-C/lexicon.txt \
  --vits-tokens=./vits-zh-hf-fanchen-C/tokens.txt \
  --vits-length-scale=1.0 \
  --tts-rule-fsts=./vits-zh-hf-fanchen-C/phone.fst,./vits-zh-hf-fanchen-C/number.fst \
  --output-filename="./numbers-100.wav" \
  "有困难,请拨打110 或者18601200909"

sherpa-onnx-offline-tts \
  --sid=14 \
  --vits-model=./vits-zh-hf-fanchen-C/vits-zh-hf-fanchen-C.onnx \
  --vits-dict-dir=./vits-zh-hf-fanchen-C/dict \
  --vits-lexicon=./vits-zh-hf-fanchen-C/lexicon.txt \
  --vits-tokens=./vits-zh-hf-fanchen-C/tokens.txt \
  --vits-length-scale=1.0 \
  --output-filename="./wo-mi-14.wav" \
  "小米的使命是,始终坚持做感动人心、价格厚道的好产品,让全球每个人都能享受科技带来的美好生活。"

sherpa-onnx-offline-tts \
  --sid=102 \
  --vits-model=./vits-zh-hf-fanchen-C/vits-zh-hf-fanchen-C.onnx \
  --vits-dict-dir=./vits-zh-hf-fanchen-C/dict \
  --vits-lexicon=./vits-zh-hf-fanchen-C/lexicon.txt \
  --vits-tokens=./vits-zh-hf-fanchen-C/tokens.txt \
  --tts-rule-fsts=./vits-zh-hf-fanchen-C/number.fst \
  --vits-length-scale=1.0 \
  --output-filename="./heteronym-102.wav" \
  "35年前,他于长沙出生, 在长白山长大。9年前他当上了银行的领导,主管行政。1天前莅临我行指导工作。"
Wave filename Content Text
value-2x.wav 小米的核心价值观是什么?答案是真诚热爱!
numbers.wav 小米有14岁了
numbers-100.wav 有困难,请拨打110 或者18601200909
wo-mi-14.wav 小米的使命是,始终坚持做感动人心、价格厚道的好产品,让全球每个人都能享受科技带来的美好生活。
heteronym-102.wav 35年前,他于长沙出生, 在长白山长大。9年前他当上了银行的领导,主管行政。1天前莅临我行指导工作。

RTF on Raspberry Pi 4 Model B Rev 1.5

We use the following command to test the RTF of this model on Raspberry Pi 4 Model B Rev 1.5:

for t in 1 2 3 4; do
 ./build/bin/sherpa-onnx-offline-tts \
   --num-threads=$t \
   --vits-model=./vits-zh-hf-fanchen-C/vits-zh-hf-fanchen-C.onnx \
   --vits-dict-dir=./vits-zh-hf-fanchen-C/dict \
   --vits-lexicon=./vits-zh-hf-fanchen-C/lexicon.txt \
   --vits-tokens=./vits-zh-hf-fanchen-C/tokens.txt \
   "当夜幕降临,星光点点,伴随着微风拂面,我在静谧中感受着时光的流转,思念如涟漪荡漾,梦境如画卷展开,我与自然融为一体,沉静在这片宁静的美丽之中,感受着生命的奇迹与 温柔."
done

The results are given below:

num_threads

1

2

3

4

RTF

4.306

2.451

1.846

1.600

csukuangfj/vits-zh-hf-fanchen-wnj (Chinese, 1 male)

You can download the model using the following commands:

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-zh-hf-fanchen-wnj.tar.bz2
tar xvf vits-zh-hf-fanchen-wnj.tar.bz2
rm vits-zh-hf-fanchen-wnj.tar.bz2
# information about model files
total 594760
-rw-r--r--  1 fangjun  staff    58K Apr 21 13:40 date.fst
drwxr-xr-x  9 fangjun  staff   288B Apr 19 20:42 dict
-rwxr-xr-x  1 fangjun  staff   3.9K Apr 21 13:40 export-onnx-zh-hf-fanchen-models.py
-rwxr-xr-x  1 fangjun  staff   2.4K Apr 21 13:40 generate-lexicon-zh-hf-fanchen-models.py
-rw-r--r--  1 fangjun  staff   2.3M Apr 21 13:40 lexicon.txt
-rw-r--r--  1 fangjun  staff    21K Apr 21 13:40 new_heteronym.fst
-rw-r--r--  1 fangjun  staff    63K Apr 21 13:40 number.fst
-rw-r--r--  1 fangjun  staff    87K Apr 21 13:40 phone.fst
-rw-r--r--  1 fangjun  staff   172M Apr 21 13:40 rule.far
-rw-r--r--  1 fangjun  staff   331B Apr 21 13:40 tokens.txt
-rwxr-xr-x  1 fangjun  staff   1.9K Apr 21 13:40 vits-zh-hf-fanchen-models.sh
-rw-r--r--  1 fangjun  staff   115M Apr 21 13:40 vits-zh-hf-fanchen-wnj.onnx

usage:

sherpa-onnx-offline-tts \
  --vits-model=./vits-zh-hf-fanchen-wnj/vits-zh-hf-fanchen-wnj.onnx \
  --vits-dict-dir=./vits-zh-hf-fanchen-wnj/dict \
  --vits-lexicon=./vits-zh-hf-fanchen-wnj/lexicon.txt \
  --vits-tokens=./vits-zh-hf-fanchen-wnj/tokens.txt \
  --output-filename="./kuayue.wav" \
  "升级人车家全生态,小米迎跨越时刻。"

sherpa-onnx-offline-tts \
  --vits-model=./vits-zh-hf-fanchen-wnj/vits-zh-hf-fanchen-wnj.onnx \
  --vits-dict-dir=./vits-zh-hf-fanchen-wnj/dict \
  --vits-lexicon=./vits-zh-hf-fanchen-wnj/lexicon.txt \
  --vits-tokens=./vits-zh-hf-fanchen-wnj/tokens.txt \
  --tts-rule-fsts=./vits-zh-hf-fanchen-wnj/number.fst \
  --output-filename="./os.wav" \
  "这一全新操作系统,是小米14年来技术积淀的结晶。"
Wave filename Content Text
kuayue.wav 升级人车家全生态,小米迎跨越时刻。
os.wav 这一全新操作系统,是小米14年来技术积淀的结晶。

RTF on Raspberry Pi 4 Model B Rev 1.5

We use the following command to test the RTF of this model on Raspberry Pi 4 Model B Rev 1.5:

for t in 1 2 3 4; do
 ./build/bin/sherpa-onnx-offline-tts \
   --num-threads=$t \
   --vits-model=./vits-zh-hf-fanchen-wnj/vits-zh-hf-fanchen-wnj.onnx \
   --vits-dict-dir=./vits-zh-hf-fanchen-wnj/dict \
   --vits-lexicon=./vits-zh-hf-fanchen-wnj/lexicon.txt \
   --vits-tokens=./vits-zh-hf-fanchen-wnj/tokens.txt \
   "当夜幕 降临,星光点点,伴随着微风拂面,我在静谧中感受着时光的流转,思念如涟漪荡漾,梦境如画卷展开,我与自然融为一体,沉静在这片宁静的美丽之中,感受着生命的 奇迹与温柔."
done

The results are given below:

num_threads

1

2

3

4

RTF

4.276

2.505

1.827

1.608

csukuangfj/vits-zh-hf-theresa (Chinese, 804 speakers)

You can download the model with the following commands:

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-zh-hf-theresa.tar.bz2
tar xvf vits-zh-hf-theresa.tar.bz2
rm vits-zh-hf-theresa.tar.bz2
# information about model files

total 596992
-rw-r--r--  1 fangjun  staff    58K Apr 21 13:39 date.fst
drwxr-xr-x  9 fangjun  staff   288B Apr 19 20:42 dict
-rw-r--r--  1 fangjun  staff   2.6M Apr 21 13:39 lexicon.txt
-rw-r--r--  1 fangjun  staff    21K Apr 21 13:39 new_heteronym.fst
-rw-r--r--  1 fangjun  staff    63K Apr 21 13:39 number.fst
-rw-r--r--  1 fangjun  staff    87K Apr 21 13:39 phone.fst
-rw-r--r--  1 fangjun  staff   172M Apr 21 13:39 rule.far
-rw-r--r--  1 fangjun  staff   116M Apr 21 13:39 theresa.onnx
-rw-r--r--  1 fangjun  staff   268B Apr 21 13:39 tokens.txt
-rwxr-xr-x  1 fangjun  staff   5.3K Apr 21 13:39 vits-zh-hf-models.py
-rwxr-xr-x  1 fangjun  staff   571B Apr 21 13:39 vits-zh-hf-models.sh

usage:

sherpa-onnx-offline-tts \
  --vits-model=./vits-zh-hf-theresa/theresa.onnx \
  --vits-dict-dir=./vits-zh-hf-theresa/dict \
  --vits-lexicon=./vits-zh-hf-theresa/lexicon.txt \
  --vits-tokens=./vits-zh-hf-theresa/tokens.txt \
  --sid=0 \
  --output-filename="./reai-0.wav" \
  "真诚就是不欺人也不自欺。热爱就是全心投入并享受其中。"

sherpa-onnx-offline-tts \
  --vits-model=./vits-zh-hf-theresa/theresa.onnx \
  --vits-dict-dir=./vits-zh-hf-theresa/dict \
  --vits-lexicon=./vits-zh-hf-theresa/lexicon.txt \
  --vits-tokens=./vits-zh-hf-theresa/tokens.txt \
  --tts-rule-fsts=./vits-zh-hf-theresa/number.fst \
  --debug=1 \
  --sid=88 \
  --output-filename="./mi14-88.wav" \
  "小米14一周销量破1000000!"
Wave filename Content Text
reai-0.wav 真诚就是不欺人也不自欺。热爱就是全心投入并享受其中。
m14-88.wav 小米14一周销量破1000000!

RTF on Raspberry Pi 4 Model B Rev 1.5

We use the following command to test the RTF of this model on Raspberry Pi 4 Model B Rev 1.5:

for t in 1 2 3 4; do
 ./build/bin/sherpa-onnx-offline-tts \
   --num-threads=$t \
   --vits-model=./vits-zh-hf-theresa/theresa.onnx \
   --vits-dict-dir=./vits-zh-hf-theresa/dict \
   --vits-lexicon=./vits-zh-hf-theresa/lexicon.txt \
   --vits-tokens=./vits-zh-hf-theresa/tokens.txt \
   "当夜幕降临,星光点点,伴随着微风拂面,我在静谧中感受着时光的流转,思念如涟漪荡漾,梦境如画卷展开,我与自然融为一体,沉静在这片宁静的美丽之中,感受着生命的奇迹与 温柔."
done

The results are given below:

num_threads

1

2

3

4

RTF

6.032

3.448

2.566

2.210

csukuangfj/vits-zh-hf-eula (Chinese, 804 speakers)

You can download the model using the following commands:

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-zh-hf-eula.tar.bz2
tar xvf vits-zh-hf-eula.tar.bz2
rm vits-zh-hf-eula.tar.bz2
# information about model files

total 596992
-rw-r--r--  1 fangjun  staff    58K Apr 21 13:39 date.fst
drwxr-xr-x  9 fangjun  staff   288B Apr 19 20:42 dict
-rw-r--r--  1 fangjun  staff   116M Apr 21 13:39 eula.onnx
-rw-r--r--  1 fangjun  staff   2.6M Apr 21 13:39 lexicon.txt
-rw-r--r--  1 fangjun  staff    21K Apr 21 13:39 new_heteronym.fst
-rw-r--r--  1 fangjun  staff    63K Apr 21 13:39 number.fst
-rw-r--r--  1 fangjun  staff    87K Apr 21 13:39 phone.fst
-rw-r--r--  1 fangjun  staff   172M Apr 21 13:39 rule.far
-rw-r--r--  1 fangjun  staff   268B Apr 21 13:39 tokens.txt
-rwxr-xr-x  1 fangjun  staff   5.3K Apr 21 13:39 vits-zh-hf-models.py
-rwxr-xr-x  1 fangjun  staff   571B Apr 21 13:39 vits-zh-hf-models.sh

usage:

sherpa-onnx-offline-tts \
  --vits-model=./vits-zh-hf-eula/eula.onnx \
  --vits-dict-dir=./vits-zh-hf-eula/dict \
  --vits-lexicon=./vits-zh-hf-eula/lexicon.txt \
  --vits-tokens=./vits-zh-hf-eula/tokens.txt \
  --debug=1 \
  --sid=666 \
  --output-filename="./news-666.wav" \
  "小米在今天上午举办的核心干部大会上,公布了新十年的奋斗目标和科技战略,并发布了小米价值观的八条诠释。"

sherpa-onnx-offline-tts \
  --vits-model=./vits-zh-hf-eula/eula.onnx \
  --vits-dict-dir=./vits-zh-hf-eula/dict \
  --vits-lexicon=./vits-zh-hf-eula/lexicon.txt \
  --vits-tokens=./vits-zh-hf-eula/tokens.txt \
  --tts-rule-fsts=./vits-zh-hf-eula/number.fst \
  --sid=99 \
  --output-filename="./news-99.wav" \
  "9月25日消息,雷军今日在微博发文称"
Wave filename Content Text
news-666.wav 小米在今天上午举办的核心干部大会上,公布了新十年的奋斗目标和科技战略,并发布了小米价值观的八条诠释。
news-99.wav 9月25日消息,雷军今日在微博发文称

RTF on Raspberry Pi 4 Model B Rev 1.5

We use the following command to test the RTF of this model on Raspberry Pi 4 Model B Rev 1.5:

for t in 1 2 3 4; do
 ./build/bin/sherpa-onnx-offline-tts \
   --num-threads=$t \
   --vits-model=./vits-zh-hf-eula/eula.onnx \
   --vits-dict-dir=./vits-zh-hf-eula/dict \
   --vits-lexicon=./vits-zh-hf-eula/lexicon.txt \
   --vits-tokens=./vits-zh-hf-eula/tokens.txt \
   "当夜幕降临,星光点点,伴随着微风拂面,我在静谧中感受着时光的流转,思念如涟漪荡漾,梦境如画卷展开,我与自然融为一体,沉静在这片宁静的美丽之中,感受着生命的奇迹与 温柔."
done

The results are given below:

num_threads

1

2

3

4

RTF

6.011

3.473

2.537

2.231

aishell3 (Chinese, multi-speaker, 174 speakers)

This model is trained on the aishell3 dataset using icefall.

It supports only Chinese and it’s a multi-speaker model and contains 174 speakers.

Hint

You can download the Android APK for this model at

https://k2-fsa.github.io/sherpa/onnx/tts/apk-engine.html

(Please search for vits-icefall-zh-aishell3 in the above Android APK page)

Note

If you are interested in how the model is converted, please see the documentation of icefall.

If you are interested in training your own model, please also refer to icefall.

icefall is also developed by us.

In the following, we describe how to download it and use it with sherpa-onnx.

Download the model

Please use the following commands to download it.

cd /path/to/sherpa-onnx

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-icefall-zh-aishell3.tar.bz2
tar xvf vits-icefall-zh-aishell3.tar.bz2
rm vits-icefall-zh-aishell3.tar.bz2

Please check that the file sizes of the pre-trained models are correct. See the file sizes of *.onnx files below.

vits-icefall-zh-aishell3 fangjun$ ls -lh *.onnx
-rw-r--r--  1 fangjun  staff    29M Mar 20 22:50 model.onnx

Generate speech with executables compiled from C++

Since there are 174 speakers available, we can choose a speaker from 0 to 173. The default speaker ID is 0.

We use speaker ID 10, 33, and 99 below to generate audio for the same text.

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline-tts \
  --vits-model=./vits-icefall-zh-aishell3/model.onnx \
  --vits-lexicon=./vits-icefall-zh-aishell3/lexicon.txt \
  --vits-tokens=./vits-icefall-zh-aishell3/tokens.txt \
  --tts-rule-fsts=./vits-icefall-zh-aishell3/phone.fst,./vits-icefall-zh-aishell3/date.fst,./vits-icefall-zh-aishell3/number.fst \
  --sid=10 \
  --output-filename=./liliana-10.wav \
  "林美丽最美丽、最漂亮、最可爱!"

./build/bin/sherpa-onnx-offline-tts \
  --vits-model=./vits-icefall-zh-aishell3/model.onnx \
  --vits-lexicon=./vits-icefall-zh-aishell3/lexicon.txt \
  --vits-tokens=./vits-icefall-zh-aishell3/tokens.txt \
  --tts-rule-fsts=./vits-icefall-zh-aishell3/phone.fst,./vits-icefall-zh-aishell3/date.fst,./vits-icefall-zh-aishell3/number.fst \
  --sid=33 \
  --output-filename=./liliana-33.wav \
  "林美丽最美丽、最漂亮、最可爱!"

./build/bin/sherpa-onnx-offline-tts \
  --vits-model=./vits-icefall-zh-aishell3/model.onnx \
  --vits-lexicon=./vits-icefall-zh-aishell3/lexicon.txt \
  --vits-tokens=./vits-icefall-zh-aishell3/tokens.txt \
  --tts-rule-fsts=./vits-icefall-zh-aishell3/phone.fst,./vits-icefall-zh-aishell3/date.fst,./vits-icefall-zh-aishell3/number.fst \
  --sid=99 \
  --output-filename=./liliana-99.wav \
  "林美丽最美丽、最漂亮、最可爱!"

It will generate 3 files: liliana-10.wav, liliana-33.wav, and liliana-99.wav.

We also support rule-based text normalization, which is implemented with OpenFst. Currently, only number normalization is supported.

Hint

We will support other normalization rules later.

The following is an example:

./build/bin/sherpa-onnx-offline-tts \
  --vits-model=./vits-icefall-zh-aishell3/model.onnx \
  --vits-lexicon=./vits-icefall-zh-aishell3/lexicon.txt \
  --vits-tokens=./vits-icefall-zh-aishell3/tokens.txt \
  --tts-rule-fsts=./vits-icefall-zh-aishell3/phone.fst,./vits-icefall-zh-aishell3/date.fst,./vits-icefall-zh-aishell3/number.fst \
  --sid=66 \
  --output-filename=./rule-66.wav \
  "35年前,他于长沙出生, 在长白山长大。9年前他当上了银行的领导,主管行政。1天前莅临我行指导工作。"
Wave filename Content Text
liliana-10.wav 林美丽最美丽、最漂亮、最可爱!
liliana-33.wav 林美丽最美丽、最漂亮、最可爱!
liliana-99.wav 林美丽最美丽、最漂亮、最可爱!
rule-66.wav 35年前,他于长沙出生, 在长白山长大。9年前他当上了银行的领导,主管行政。1天前莅临我行指导工作。

Generate speech with Python script

We use speaker ID 21, 41, and 45 below to generate audio for different transcripts.

cd /path/to/sherpa-onnx

python3 ./python-api-examples/offline-tts.py \
 --vits-model=./vits-icefall-zh-aishell3/model.onnx \
 --vits-lexicon=./vits-icefall-zh-aishell3/lexicon.txt \
 --vits-tokens=./vits-icefall-zh-aishell3/tokens.txt \
 --tts-rule-fsts=./vits-icefall-zh-aishell3/phone.fst,./vits-icefall-zh-aishell3/date.fst,./vits-icefall-zh-aishell3/number.fst \
 --sid=21 \
 --output-filename=./liubei-21.wav \
 "勿以恶小而为之,勿以善小而不为。惟贤惟德,能服于人。"

python3 ./python-api-examples/offline-tts.py \
 --vits-model=./vits-icefall-zh-aishell3/model.onnx \
 --vits-lexicon=./vits-icefall-zh-aishell3/lexicon.txt \
 --vits-tokens=./vits-icefall-zh-aishell3/tokens.txt \
 --tts-rule-fsts=./vits-icefall-zh-aishell3/phone.fst,./vits-icefall-zh-aishell3/date.fst,./vits-icefall-zh-aishell3/number.fst \
 --sid=41 \
 --output-filename=./demokelite-41.wav \
 "要留心,即使当你独自一人时,也不要说坏话或做坏事,而要学得在你自己面前比在别人面前更知耻。"

python3 ./python-api-examples/offline-tts.py \
 --vits-model=./vits-icefall-zh-aishell3/model.onnx \
 --vits-lexicon=./vits-icefall-zh-aishell3/lexicon.txt \
 --vits-tokens=./vits-icefall-zh-aishell3/tokens.txt \
 --tts-rule-fsts=./vits-icefall-zh-aishell3/phone.fst,./vits-icefall-zh-aishell3/date.fst,./vits-icefall-zh-aishell3/number.fst \
 --sid=45 \
 --output-filename=./zhugeliang-45.wav \
 "夫君子之行,静以修身,俭以养德,非淡泊无以明志,非宁静无以致远。"

It will generate 3 files: liubei-21.wav, demokelite-41.wav, and zhugeliang-45.wav.

The Python script also supports rule-based text normalization.

python3 ./python-api-examples/offline-tts.py \
 --vits-model=./vits-icefall-zh-aishell3/model.onnx \
 --vits-lexicon=./vits-icefall-zh-aishell3/lexicon.txt \
 --vits-tokens=./vits-icefall-zh-aishell3/tokens.txt \
 --tts-rule-fsts=./vits-icefall-zh-aishell3/phone.fst,./vits-icefall-zh-aishell3/date.fst,./vits-icefall-zh-aishell3/number.fst \
 --sid=103 \
 --output-filename=./rule-103.wav \
 "根据第7次全国人口普查结果表明,我国总人口有1443497378人。普查登记的大陆31个省、自治区、直辖市和现役军人的人口共1411778724人。电话号码是110。手机号是13812345678"
Wave filename Content Text
liube-21.wav 勿以恶小而为之,勿以善小而不为。惟贤惟德,能服于人。
demokelite-41.wav 要留心,即使当你独自一人时,也不要说坏话或做坏事,而要学得在你自己面前比在别人面前更知耻。
zhugeliang-45.wav 夫君子之行,静以修身,俭以养德,非淡泊无以明志,非宁静无以致远。
rule-103.wav 根据第7次全国人口普查结果表明,我国总人口有1443497378人。普查登记的大陆31个省、自治区、直辖市和现役军人的人口共1411778724人。电话号码是110。手机号是13812345678

RTF on Raspberry Pi 4 Model B Rev 1.5

We use the following command to test the RTF of this model on Raspberry Pi 4 Model B Rev 1.5:

for t in 1 2 3 4; do
 build/bin/sherpa-onnx-offline-tts \
   --num-threads=$t \
   --vits-model=./vits-icefall-zh-aishell3/model.onnx \
   --vits-lexicon=./vits-icefall-zh-aishell3/lexicon.txt \
   --vits-tokens=./vits-icefall-zh-aishell3/tokens.txt \
   --sid=66  \
   "当夜幕降临,星光点点,伴随着微风拂面,我在静谧中感受着时光的流转,思念如涟漪荡漾,梦境如画卷展开,我与自然融为一体,沉静在这片宁静的美丽之中,感受着生命的奇迹与温柔."
done

The results are given below:

num_threads

1

2

3

4

RTF

0.365

0.220

0.171

0.156

en_US-lessac-medium (English, single-speaker)

This model is converted from https://huggingface.co/rhasspy/piper-voices/tree/main/en/en_US/lessac/medium.

The dataset used to train the model is lessac_blizzard2013.

Hint

The model is from piper.

In the following, we describe how to download it and use it with sherpa-onnx.

Download the model

Please use the following commands to download it.

cd /path/to/sherpa-onnx

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-piper-en_US-lessac-medium.tar.bz2
tar xf vits-piper-en_US-lessac-medium.tar.bz2

Hint

You can find a lot of pre-trained models for over 40 languages at <https://github.com/k2-fsa/sherpa-onnx/releases/tag/tts-models>.

Generate speech with executables compiled from C++

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline-tts \
  --vits-model=./vits-piper-en_US-lessac-medium/en_US-lessac-medium.onnx \
  --vits-data-dir=./vits-piper-en_US-lessac-medium/espeak-ng-data \
  --vits-tokens=./vits-piper-en_US-lessac-medium/tokens.txt \
  --output-filename=./liliana-piper-en_US-lessac-medium.wav \
  "liliana, the most beautiful and lovely assistant of our team!"

Hint

You can also use

cd /path/to/sherpa-onnx

./build/bin/sherpa-onnx-offline-tts-play \
  --vits-model=./vits-piper-en_US-lessac-medium/en_US-lessac-medium.onnx \
  --vits-data-dir=./vits-piper-en_US-lessac-medium/espeak-ng-data \
  --vits-tokens=./vits-piper-en_US-lessac-medium/tokens.txt \
  --output-filename=./liliana-piper-en_US-lessac-medium.wav \
  "liliana, the most beautiful and lovely assistant of our team!"

which will play the audio as it is generating.

After running, it will generate a file liliana-piper.wav in the current directory.

soxi ./liliana-piper-en_US-lessac-medium.wav

Input File     : './liliana-piper-en_US-lessac-medium.wav'
Channels       : 1
Sample Rate    : 22050
Precision      : 16-bit
Duration       : 00:00:03.48 = 76800 samples ~ 261.224 CDDA sectors
File Size      : 154k
Bit Rate       : 353k
Sample Encoding: 16-bit Signed Integer PCM
Wave filename Content Text
liliana-piper-en_US-lessac-medium.wav liliana, the most beautiful and lovely assistant of our team!

Generate speech with Python script

cd /path/to/sherpa-onnx

python3 ./python-api-examples/offline-tts.py \
 --vits-model=./vits-piper-en_US-lessac-medium/en_US-lessac-medium.onnx \
 --vits-data-dir=./vits-piper-en_US-lessac-medium/espeak-ng-data \
 --vits-tokens=./vits-piper-en_US-lessac-medium/tokens.txt \
 --output-filename=./armstrong-piper-en_US-lessac-medium.wav \
 "That's one small step for a man, a giant leap for mankind."

Hint

You can also use

cd /path/to/sherpa-onnx

python3 ./python-api-examples/offline-tts-play.py \
  --vits-model=./vits-piper-en_US-lessac-medium/en_US-lessac-medium.onnx \
  --vits-data-dir=./vits-piper-en_US-lessac-medium/espeak-ng-data \
  --vits-tokens=./vits-piper-en_US-lessac-medium/tokens.txt \
  --output-filename=./armstrong-piper-en_US-lessac-medium.wav \
  "That's one small step for a man, a giant leap for mankind."

which will play the audio as it is generating.

After running, it will generate a file armstrong-piper-en_US-lessac-medium.wav in the current directory.

soxi ./armstrong-piper-en_US-lessac-medium.wav

Input File     : './armstrong-piper-en_US-lessac-medium.wav'
Channels       : 1
Sample Rate    : 22050
Precision      : 16-bit
Duration       : 00:00:03.74 = 82432 samples ~ 280.381 CDDA sectors
File Size      : 165k
Bit Rate       : 353k
Sample Encoding: 16-bit Signed Integer PCM
Wave filename Content Text
armstrong-piper-en_US-lessac-medium.wav That's one small step for a man, a giant leap for mankind.

RTF on Raspberry Pi 4 Model B Rev 1.5

We use the following command to test the RTF of this model on Raspberry Pi 4 Model B Rev 1.5:

for t in 1 2 3 4; do
 ./build/bin/sherpa-onnx-offline-tts \
   --num-threads=$t \
   --vits-model=./vits-piper-en_US-lessac-medium/en_US-lessac-medium.onnx \
   --vits-data-dir=./vits-piper-en_US-lessac-medium/espeak-ng-data \
   --vits-tokens=./vits-piper-en_US-lessac-medium/tokens.txt \
   "Friends fell out often because life was changing so fast. The easiest thing in the world was to lose touch with someone."
done

The results are given below:

num_threads

1

2

3

4

RTF

0.774

0.482

0.390

0.357