vits
This page lists pre-trained vits models.
All models in a single table
The following table summarizes the information of all models in this page.
Note
Since there are more than 100
pre-trained models for over 40
languages,
we don’t list all of them on this page. Please find them at
https://github.com/k2-fsa/sherpa-onnx/releases/tag/tts-models.
You can try all the models at the following huggingface space. https://huggingface.co/spaces/k2-fsa/text-to-speech.
Hint
You can find Android APKs for each model at the following page
Model |
Language |
# Speakers |
Dataset |
Model filesize (MB) |
Sample rate (Hz) |
Chinese |
187 |
N/A |
116 |
16000 |
|
Chinese |
1 |
N/A |
116 |
16000 |
|
Chinese |
804 |
N/A |
117 |
22050 |
|
Chinese |
804 |
N/A |
117 |
22050 |
|
Chinese |
174 |
116 |
8000 |
||
English (US) |
1 (Female) |
109 |
22050 |
||
English |
109 |
116 |
22050 |
||
English (US) |
1 (Male) |
61 |
22050 |
ljspeech (English, single-speaker)
This model is converted from pretrained_ljspeech.pth, which is trained by the vits author Jaehyeon Kim on the LJ Speech dataset. It supports only English and is a single-speaker model.
Note
If you are interested in how the model is converted, please see https://github.com/k2-fsa/sherpa-onnx/blob/master/scripts/vits/export-onnx-ljs.py
In the following, we describe how to download it and use it with sherpa-onnx.
Download the model
Please use the following commands to download it.
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-ljs.tar.bz2
tar xvf vits-ljs.tar.bz2
rm vits-ljs.tar.bz2
Please check that the file sizes of the pre-trained models are correct. See
the file sizes of *.onnx
files below.
-rw-r--r-- 1 1001 127 109M Apr 22 02:38 vits-ljs/vits-ljs.onnx
Generate speech with executable compiled from C++
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline-tts \
--vits-model=./vits-ljs/vits-ljs.onnx \
--vits-lexicon=./vits-ljs/lexicon.txt \
--vits-tokens=./vits-ljs/tokens.txt \
--output-filename=./liliana.wav \
'liliana, the most beautiful and lovely assistant of our team!'
After running, it will generate a file liliana.wav
in the current directory.
soxi ./liliana.wav
Input File : './liliana.wav'
Channels : 1
Sample Rate : 22050
Precision : 16-bit
Duration : 00:00:04.39 = 96768 samples ~ 329.143 CDDA sectors
File Size : 194k
Bit Rate : 353k
Sample Encoding: 16-bit Signed Integer PCM
Wave filename | Content | Text |
---|---|---|
liliana.wav | liliana, the most beautiful and lovely assistant of our team! |
Generate speech with Python script
cd /path/to/sherpa-onnx
python3 ./python-api-examples/offline-tts.py \
--vits-model=./vits-ljs/vits-ljs.onnx \
--vits-lexicon=./vits-ljs/lexicon.txt \
--vits-tokens=./vits-ljs/tokens.txt \
--output-filename=./armstrong.wav \
"That's one small step for a man, a giant leap for mankind."
After running, it will generate a file armstrong.wav
in the current directory.
soxi ./armstrong.wav
Input File : './armstrong.wav'
Channels : 1
Sample Rate : 22050
Precision : 16-bit
Duration : 00:00:04.81 = 105984 samples ~ 360.49 CDDA sectors
File Size : 212k
Bit Rate : 353k
Sample Encoding: 16-bit Signed Integer PCM
Wave filename | Content | Text |
---|---|---|
armstrong.wav | That's one small step for a man, a giant leap for mankind. |
VCTK (English, multi-speaker, 109 speakers)
This model is converted from pretrained_vctk.pth, which is trained by the vits author Jaehyeon Kim on the VCTK dataset. It supports only English and is a multi-speaker model. It contains 109 speakers.
Note
If you are interested in how the model is converted, please see https://github.com/k2-fsa/sherpa-onnx/blob/master/scripts/vits/export-onnx-vctk.py
In the following, we describe how to download it and use it with sherpa-onnx.
Download the model
Please use the following commands to download it.
cd /path/to/sherpa-onnx
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/csukuangfj/vits-vctk
cd vits-ctk
git lfs pull --include "*.onnx"
Please check that the file sizes of the pre-trained models are correct. See
the file sizes of *.onnx
files below.
vits-vctk fangjun$ ls -lh *.onnx
-rw-r--r-- 1 fangjun staff 37M Oct 16 10:57 vits-vctk.int8.onnx
-rw-r--r-- 1 fangjun staff 116M Oct 16 10:57 vits-vctk.onnx
Generate speech with executable compiled from C++
Since there are 109 speakers available, we can choose a speaker from 0 to 198. The default speaker ID is 0.
We use speaker ID 0, 10, and 108 below to generate audio for the same text.
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline-tts \
--vits-model=./vits-vctk/vits-vctk.onnx \
--vits-lexicon=./vits-vctk/lexicon.txt \
--vits-tokens=./vits-vctk/tokens.txt \
--sid=0 \
--output-filename=./kennedy-0.wav \
'Ask not what your country can do for you; ask what you can do for your country.'
./build/bin/sherpa-onnx-offline-tts \
--vits-model=./vits-vctk/vits-vctk.onnx \
--vits-lexicon=./vits-vctk/lexicon.txt \
--vits-tokens=./vits-vctk/tokens.txt \
--sid=10 \
--output-filename=./kennedy-10.wav \
'Ask not what your country can do for you; ask what you can do for your country.'
./build/bin/sherpa-onnx-offline-tts \
--vits-model=./vits-vctk/vits-vctk.onnx \
--vits-lexicon=./vits-vctk/lexicon.txt \
--vits-tokens=./vits-vctk/tokens.txt \
--sid=108 \
--output-filename=./kennedy-108.wav \
'Ask not what your country can do for you; ask what you can do for your country.'
It will generate 3 files: kennedy-0.wav
, kennedy-10.wav
, and kennedy-108.wav
.
Wave filename | Content | Text |
---|---|---|
kennedy-0.wav | Ask not what your country can do for you; ask what you can do for your country. | |
kennedy-10.wav | Ask not what your country can do for you; ask what you can do for your country. | |
kennedy-108.wav | Ask not what your country can do for you; ask what you can do for your country. |
Generate speech with Python script
We use speaker ID 30, 66, and 99 below to generate audio for different transcripts.
cd /path/to/sherpa-onnx
python3 ./python-api-examples/offline-tts.py \
--vits-model=./vits-vctk/vits-vctk.onnx \
--vits-lexicon=./vits-vctk/lexicon.txt \
--vits-tokens=./vits-vctk/tokens.txt \
--sid=30 \
--output-filename=./einstein-30.wav \
"Life is like riding a bicycle. To keep your balance, you must keep moving."
python3 ./python-api-examples/offline-tts.py \
--vits-model=./vits-vctk/vits-vctk.onnx \
--vits-lexicon=./vits-vctk/lexicon.txt \
--vits-tokens=./vits-vctk/tokens.txt \
--sid=66 \
--output-filename=./franklin-66.wav \
"Three can keep a secret, if two of them are dead."
python3 ./python-api-examples/offline-tts.py \
--vits-model=./vits-vctk/vits-vctk.onnx \
--vits-lexicon=./vits-vctk/lexicon.txt \
--vits-tokens=./vits-vctk/tokens.txt \
--sid=99 \
--output-filename=./martin-99.wav \
"Darkness cannot drive out darkness: only light can do that. Hate cannot drive out hate: only love can do that"
It will generate 3 files: einstein-30.wav
, franklin-66.wav
, and martin-99.wav
.
Wave filename | Content | Text |
---|---|---|
einstein-30.wav | Life is like riding a bicycle. To keep your balance, you must keep moving. | |
franklin-66.wav | Three can keep a secret, if two of them are dead. | |
martin-99.wav | Darkness cannot drive out darkness: only light can do that. Hate cannot drive out hate: only love can do that |
csukuangfj/vits-zh-hf-fanchen-C (Chinese, 187 speakers)
You can download the model using the following commands:
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-zh-hf-fanchen-C.tar.bz2
tar xvf vits-zh-hf-fanchen-C.tar.bz2
rm vits-zh-hf-fanchen-C.tar.bz2
Hint
This model is converted from https://huggingface.co/spaces/lkz99/tts_model/tree/main/zh
# information about model files
total 291M
-rw-r--r-- 1 1001 127 58K Apr 21 05:40 date.fst
drwxr-xr-x 3 1001 127 4.0K Apr 19 12:42 dict
-rwxr-xr-x 1 1001 127 4.0K Apr 21 05:40 export-onnx-zh-hf-fanchen-models.py
-rwxr-xr-x 1 1001 127 2.5K Apr 21 05:40 generate-lexicon-zh-hf-fanchen-models.py
-rw-r--r-- 1 1001 127 2.4M Apr 21 05:40 lexicon.txt
-rw-r--r-- 1 1001 127 22K Apr 21 05:40 new_heteronym.fst
-rw-r--r-- 1 1001 127 63K Apr 21 05:40 number.fst
-rw-r--r-- 1 1001 127 87K Apr 21 05:40 phone.fst
-rw-r--r-- 1 1001 127 173M Apr 21 05:40 rule.far
-rw-r--r-- 1 1001 127 331 Apr 21 05:40 tokens.txt
-rw-r--r-- 1 1001 127 116M Apr 21 05:40 vits-zh-hf-fanchen-C.onnx
-rwxr-xr-x 1 1001 127 2.0K Apr 21 05:40 vits-zh-hf-fanchen-models.sh
usage:
sherpa-onnx-offline-tts \
--vits-model=./vits-zh-hf-fanchen-C/vits-zh-hf-fanchen-C.onnx \
--vits-dict-dir=./vits-zh-hf-fanchen-C/dict \
--vits-lexicon=./vits-zh-hf-fanchen-C/lexicon.txt \
--vits-tokens=./vits-zh-hf-fanchen-C/tokens.txt \
--vits-length-scale=0.5 \
--output-filename="./value-2x.wav" \
"小米的核心价值观是什么?答案是真诚热爱!"
sherpa-onnx-offline-tts \
--vits-model=./vits-zh-hf-fanchen-C/vits-zh-hf-fanchen-C.onnx \
--vits-dict-dir=./vits-zh-hf-fanchen-C/dict \
--vits-lexicon=./vits-zh-hf-fanchen-C/lexicon.txt \
--vits-tokens=./vits-zh-hf-fanchen-C/tokens.txt \
--vits-length-scale=1.0 \
--tts-rule-fsts=./vits-zh-hf-fanchen-C/number.fst \
--output-filename="./numbers.wav" \
"小米有14岁了"
sherpa-onnx-offline-tts \
--sid=100 \
--vits-model=./vits-zh-hf-fanchen-C/vits-zh-hf-fanchen-C.onnx \
--vits-dict-dir=./vits-zh-hf-fanchen-C/dict \
--vits-lexicon=./vits-zh-hf-fanchen-C/lexicon.txt \
--vits-tokens=./vits-zh-hf-fanchen-C/tokens.txt \
--vits-length-scale=1.0 \
--tts-rule-fsts=./vits-zh-hf-fanchen-C/phone.fst,./vits-zh-hf-fanchen-C/number.fst \
--output-filename="./numbers-100.wav" \
"有困难,请拨打110 或者18601200909"
sherpa-onnx-offline-tts \
--sid=14 \
--vits-model=./vits-zh-hf-fanchen-C/vits-zh-hf-fanchen-C.onnx \
--vits-dict-dir=./vits-zh-hf-fanchen-C/dict \
--vits-lexicon=./vits-zh-hf-fanchen-C/lexicon.txt \
--vits-tokens=./vits-zh-hf-fanchen-C/tokens.txt \
--vits-length-scale=1.0 \
--output-filename="./wo-mi-14.wav" \
"小米的使命是,始终坚持做感动人心、价格厚道的好产品,让全球每个人都能享受科技带来的美好生活。"
sherpa-onnx-offline-tts \
--sid=102 \
--vits-model=./vits-zh-hf-fanchen-C/vits-zh-hf-fanchen-C.onnx \
--vits-dict-dir=./vits-zh-hf-fanchen-C/dict \
--vits-lexicon=./vits-zh-hf-fanchen-C/lexicon.txt \
--vits-tokens=./vits-zh-hf-fanchen-C/tokens.txt \
--tts-rule-fsts=./vits-zh-hf-fanchen-C/number.fst \
--vits-length-scale=1.0 \
--output-filename="./heteronym-102.wav" \
"35年前,他于长沙出生, 在长白山长大。9年前他当上了银行的领导,主管行政。1天前莅临我行指导工作。"
Wave filename | Content | Text |
---|---|---|
value-2x.wav | 小米的核心价值观是什么?答案是真诚热爱! | |
numbers.wav | 小米有14岁了 | |
numbers-100.wav | 有困难,请拨打110 或者18601200909 | |
wo-mi-14.wav | 小米的使命是,始终坚持做感动人心、价格厚道的好产品,让全球每个人都能享受科技带来的美好生活。 | |
heteronym-102.wav | 35年前,他于长沙出生, 在长白山长大。9年前他当上了银行的领导,主管行政。1天前莅临我行指导工作。 |
csukuangfj/vits-zh-hf-fanchen-wnj (Chinese, 1 male)
You can download the model using the following commands:
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-zh-hf-fanchen-wnj.tar.bz2
tar xvf vits-zh-hf-fanchen-wnj.tar.bz2
rm vits-zh-hf-fanchen-wnj.tar.bz2
Hint
This model is converted from https://huggingface.co/spaces/lkz99/tts_model/blob/main/G_wnj_latest.pth
# information about model files
total 594760
-rw-r--r-- 1 fangjun staff 58K Apr 21 13:40 date.fst
drwxr-xr-x 9 fangjun staff 288B Apr 19 20:42 dict
-rwxr-xr-x 1 fangjun staff 3.9K Apr 21 13:40 export-onnx-zh-hf-fanchen-models.py
-rwxr-xr-x 1 fangjun staff 2.4K Apr 21 13:40 generate-lexicon-zh-hf-fanchen-models.py
-rw-r--r-- 1 fangjun staff 2.3M Apr 21 13:40 lexicon.txt
-rw-r--r-- 1 fangjun staff 21K Apr 21 13:40 new_heteronym.fst
-rw-r--r-- 1 fangjun staff 63K Apr 21 13:40 number.fst
-rw-r--r-- 1 fangjun staff 87K Apr 21 13:40 phone.fst
-rw-r--r-- 1 fangjun staff 172M Apr 21 13:40 rule.far
-rw-r--r-- 1 fangjun staff 331B Apr 21 13:40 tokens.txt
-rwxr-xr-x 1 fangjun staff 1.9K Apr 21 13:40 vits-zh-hf-fanchen-models.sh
-rw-r--r-- 1 fangjun staff 115M Apr 21 13:40 vits-zh-hf-fanchen-wnj.onnx
usage:
sherpa-onnx-offline-tts \
--vits-model=./vits-zh-hf-fanchen-wnj/vits-zh-hf-fanchen-wnj.onnx \
--vits-dict-dir=./vits-zh-hf-fanchen-wnj/dict \
--vits-lexicon=./vits-zh-hf-fanchen-wnj/lexicon.txt \
--vits-tokens=./vits-zh-hf-fanchen-wnj/tokens.txt \
--output-filename="./kuayue.wav" \
"升级人车家全生态,小米迎跨越时刻。"
sherpa-onnx-offline-tts \
--vits-model=./vits-zh-hf-fanchen-wnj/vits-zh-hf-fanchen-wnj.onnx \
--vits-dict-dir=./vits-zh-hf-fanchen-wnj/dict \
--vits-lexicon=./vits-zh-hf-fanchen-wnj/lexicon.txt \
--vits-tokens=./vits-zh-hf-fanchen-wnj/tokens.txt \
--tts-rule-fsts=./vits-zh-hf-fanchen-wnj/number.fst \
--output-filename="./os.wav" \
"这一全新操作系统,是小米14年来技术积淀的结晶。"
Wave filename | Content | Text |
---|---|---|
kuayue.wav | 升级人车家全生态,小米迎跨越时刻。 | |
os.wav | 这一全新操作系统,是小米14年来技术积淀的结晶。 |
csukuangfj/vits-zh-hf-theresa (Chinese, 804 speakers)
You can download the model with the following commands:
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-zh-hf-theresa.tar.bz2
tar xvf vits-zh-hf-theresa.tar.bz2
rm vits-zh-hf-theresa.tar.bz2
Hint
This model is converted from https://huggingface.co/spaces/zomehwh/vits-models-genshin-bh3/tree/main/pretrained_models/theresa
# information about model files
total 596992
-rw-r--r-- 1 fangjun staff 58K Apr 21 13:39 date.fst
drwxr-xr-x 9 fangjun staff 288B Apr 19 20:42 dict
-rw-r--r-- 1 fangjun staff 2.6M Apr 21 13:39 lexicon.txt
-rw-r--r-- 1 fangjun staff 21K Apr 21 13:39 new_heteronym.fst
-rw-r--r-- 1 fangjun staff 63K Apr 21 13:39 number.fst
-rw-r--r-- 1 fangjun staff 87K Apr 21 13:39 phone.fst
-rw-r--r-- 1 fangjun staff 172M Apr 21 13:39 rule.far
-rw-r--r-- 1 fangjun staff 116M Apr 21 13:39 theresa.onnx
-rw-r--r-- 1 fangjun staff 268B Apr 21 13:39 tokens.txt
-rwxr-xr-x 1 fangjun staff 5.3K Apr 21 13:39 vits-zh-hf-models.py
-rwxr-xr-x 1 fangjun staff 571B Apr 21 13:39 vits-zh-hf-models.sh
usage:
sherpa-onnx-offline-tts \
--vits-model=./vits-zh-hf-theresa/theresa.onnx \
--vits-dict-dir=./vits-zh-hf-theresa/dict \
--vits-lexicon=./vits-zh-hf-theresa/lexicon.txt \
--vits-tokens=./vits-zh-hf-theresa/tokens.txt \
--sid=0 \
--output-filename="./reai-0.wav" \
"真诚就是不欺人也不自欺。热爱就是全心投入并享受其中。"
sherpa-onnx-offline-tts \
--vits-model=./vits-zh-hf-theresa/theresa.onnx \
--vits-dict-dir=./vits-zh-hf-theresa/dict \
--vits-lexicon=./vits-zh-hf-theresa/lexicon.txt \
--vits-tokens=./vits-zh-hf-theresa/tokens.txt \
--tts-rule-fsts=./vits-zh-hf-theresa/number.fst \
--debug=1 \
--sid=88 \
--output-filename="./mi14-88.wav" \
"小米14一周销量破1000000!"
Wave filename | Content | Text |
---|---|---|
reai-0.wav | 真诚就是不欺人也不自欺。热爱就是全心投入并享受其中。 | |
m14-88.wav | 小米14一周销量破1000000! |
csukuangfj/vits-zh-hf-eula (Chinese, 804 speakers)
You can download the model using the following commands:
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-zh-hf-eula.tar.bz2
tar xvf vits-zh-hf-eula.tar.bz2
rm vits-zh-hf-eula.tar.bz2
Hint
This model is converted from https://huggingface.co/spaces/zomehwh/vits-models-genshin-bh3/tree/main/pretrained_models/eula
# information about model files
total 596992
-rw-r--r-- 1 fangjun staff 58K Apr 21 13:39 date.fst
drwxr-xr-x 9 fangjun staff 288B Apr 19 20:42 dict
-rw-r--r-- 1 fangjun staff 116M Apr 21 13:39 eula.onnx
-rw-r--r-- 1 fangjun staff 2.6M Apr 21 13:39 lexicon.txt
-rw-r--r-- 1 fangjun staff 21K Apr 21 13:39 new_heteronym.fst
-rw-r--r-- 1 fangjun staff 63K Apr 21 13:39 number.fst
-rw-r--r-- 1 fangjun staff 87K Apr 21 13:39 phone.fst
-rw-r--r-- 1 fangjun staff 172M Apr 21 13:39 rule.far
-rw-r--r-- 1 fangjun staff 268B Apr 21 13:39 tokens.txt
-rwxr-xr-x 1 fangjun staff 5.3K Apr 21 13:39 vits-zh-hf-models.py
-rwxr-xr-x 1 fangjun staff 571B Apr 21 13:39 vits-zh-hf-models.sh
usage:
sherpa-onnx-offline-tts \
--vits-model=./vits-zh-hf-eula/eula.onnx \
--vits-dict-dir=./vits-zh-hf-eula/dict \
--vits-lexicon=./vits-zh-hf-eula/lexicon.txt \
--vits-tokens=./vits-zh-hf-eula/tokens.txt \
--debug=1 \
--sid=666 \
--output-filename="./news-666.wav" \
"小米在今天上午举办的核心干部大会上,公布了新十年的奋斗目标和科技战略,并发布了小米价值观的八条诠释。"
sherpa-onnx-offline-tts \
--vits-model=./vits-zh-hf-eula/eula.onnx \
--vits-dict-dir=./vits-zh-hf-eula/dict \
--vits-lexicon=./vits-zh-hf-eula/lexicon.txt \
--vits-tokens=./vits-zh-hf-eula/tokens.txt \
--tts-rule-fsts=./vits-zh-hf-eula/number.fst \
--sid=99 \
--output-filename="./news-99.wav" \
"9月25日消息,雷军今日在微博发文称"
Wave filename | Content | Text |
---|---|---|
news-666.wav | 小米在今天上午举办的核心干部大会上,公布了新十年的奋斗目标和科技战略,并发布了小米价值观的八条诠释。 | |
news-99.wav | 9月25日消息,雷军今日在微博发文称 |
aishell3 (Chinese, multi-speaker, 174 speakers)
This model is trained on the aishell3 dataset using icefall.
It supports only Chinese and it’s a multi-speaker model and contains 174 speakers.
Hint
You can download the Android APK for this model at
https://k2-fsa.github.io/sherpa/onnx/tts/apk-engine.html
(Please search for vits-icefall-zh-aishell3
in the above Android APK page)
Note
If you are interested in how the model is converted, please see the documentation of icefall.
If you are interested in training your own model, please also refer to icefall.
icefall is also developed by us.
In the following, we describe how to download it and use it with sherpa-onnx.
Download the model
Please use the following commands to download it.
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-icefall-zh-aishell3.tar.bz2
tar xvf vits-icefall-zh-aishell3.tar.bz2
rm vits-icefall-zh-aishell3.tar.bz2
Please check that the file sizes of the pre-trained models are correct. See
the file sizes of *.onnx
files below.
vits-icefall-zh-aishell3 fangjun$ ls -lh *.onnx
-rw-r--r-- 1 fangjun staff 29M Mar 20 22:50 model.onnx
Generate speech with executable compiled from C++
Since there are 174 speakers available, we can choose a speaker from 0 to 173. The default speaker ID is 0.
We use speaker ID 10, 33, and 99 below to generate audio for the same text.
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline-tts \
--vits-model=./vits-icefall-zh-aishell3/model.onnx \
--vits-lexicon=./vits-icefall-zh-aishell3/lexicon.txt \
--vits-tokens=./vits-icefall-zh-aishell3/tokens.txt \
--tts-rule-fsts=./vits-icefall-zh-aishell3/phone.fst,./vits-icefall-zh-aishell3/date.fst,./vits-icefall-zh-aishell3/number.fst \
--sid=10 \
--output-filename=./liliana-10.wav \
"林美丽最美丽、最漂亮、最可爱!"
./build/bin/sherpa-onnx-offline-tts \
--vits-model=./vits-icefall-zh-aishell3/model.onnx \
--vits-lexicon=./vits-icefall-zh-aishell3/lexicon.txt \
--vits-tokens=./vits-icefall-zh-aishell3/tokens.txt \
--tts-rule-fsts=./vits-icefall-zh-aishell3/phone.fst,./vits-icefall-zh-aishell3/date.fst,./vits-icefall-zh-aishell3/number.fst \
--sid=33 \
--output-filename=./liliana-33.wav \
"林美丽最美丽、最漂亮、最可爱!"
./build/bin/sherpa-onnx-offline-tts \
--vits-model=./vits-icefall-zh-aishell3/model.onnx \
--vits-lexicon=./vits-icefall-zh-aishell3/lexicon.txt \
--vits-tokens=./vits-icefall-zh-aishell3/tokens.txt \
--tts-rule-fsts=./vits-icefall-zh-aishell3/phone.fst,./vits-icefall-zh-aishell3/date.fst,./vits-icefall-zh-aishell3/number.fst \
--sid=99 \
--output-filename=./liliana-99.wav \
"林美丽最美丽、最漂亮、最可爱!"
It will generate 3 files: liliana-10.wav
, liliana-33.wav
, and liliana-99.wav
.
We also support rule-based text normalization, which is implemented with OpenFst. Currently, only number normalization is supported.
Hint
We will support other normalization rules later.
The following is an example:
./build/bin/sherpa-onnx-offline-tts \
--vits-model=./vits-icefall-zh-aishell3/model.onnx \
--vits-lexicon=./vits-icefall-zh-aishell3/lexicon.txt \
--vits-tokens=./vits-icefall-zh-aishell3/tokens.txt \
--tts-rule-fsts=./vits-icefall-zh-aishell3/phone.fst,./vits-icefall-zh-aishell3/date.fst,./vits-icefall-zh-aishell3/number.fst \
--sid=66 \
--output-filename=./rule-66.wav \
"35年前,他于长沙出生, 在长白山长大。9年前他当上了银行的领导,主管行政。1天前莅临我行指导工作。"
Wave filename | Content | Text |
---|---|---|
liliana-10.wav | 林美丽最美丽、最漂亮、最可爱! | |
liliana-33.wav | 林美丽最美丽、最漂亮、最可爱! | |
liliana-99.wav | 林美丽最美丽、最漂亮、最可爱! | |
rule-66.wav | 35年前,他于长沙出生, 在长白山长大。9年前他当上了银行的领导,主管行政。1天前莅临我行指导工作。 |
Generate speech with Python script
We use speaker ID 21, 41, and 45 below to generate audio for different transcripts.
cd /path/to/sherpa-onnx
python3 ./python-api-examples/offline-tts.py \
--vits-model=./vits-icefall-zh-aishell3/model.onnx \
--vits-lexicon=./vits-icefall-zh-aishell3/lexicon.txt \
--vits-tokens=./vits-icefall-zh-aishell3/tokens.txt \
--tts-rule-fsts=./vits-icefall-zh-aishell3/phone.fst,./vits-icefall-zh-aishell3/date.fst,./vits-icefall-zh-aishell3/number.fst \
--sid=21 \
--output-filename=./liubei-21.wav \
"勿以恶小而为之,勿以善小而不为。惟贤惟德,能服于人。"
python3 ./python-api-examples/offline-tts.py \
--vits-model=./vits-icefall-zh-aishell3/model.onnx \
--vits-lexicon=./vits-icefall-zh-aishell3/lexicon.txt \
--vits-tokens=./vits-icefall-zh-aishell3/tokens.txt \
--tts-rule-fsts=./vits-icefall-zh-aishell3/phone.fst,./vits-icefall-zh-aishell3/date.fst,./vits-icefall-zh-aishell3/number.fst \
--sid=41 \
--output-filename=./demokelite-41.wav \
"要留心,即使当你独自一人时,也不要说坏话或做坏事,而要学得在你自己面前比在别人面前更知耻。"
python3 ./python-api-examples/offline-tts.py \
--vits-model=./vits-icefall-zh-aishell3/model.onnx \
--vits-lexicon=./vits-icefall-zh-aishell3/lexicon.txt \
--vits-tokens=./vits-icefall-zh-aishell3/tokens.txt \
--tts-rule-fsts=./vits-icefall-zh-aishell3/phone.fst,./vits-icefall-zh-aishell3/date.fst,./vits-icefall-zh-aishell3/number.fst \
--sid=45 \
--output-filename=./zhugeliang-45.wav \
"夫君子之行,静以修身,俭以养德,非淡泊无以明志,非宁静无以致远。"
It will generate 3 files: liubei-21.wav
, demokelite-41.wav
, and zhugeliang-45.wav
.
The Python script also supports rule-based text normalization.
python3 ./python-api-examples/offline-tts.py \
--vits-model=./vits-icefall-zh-aishell3/model.onnx \
--vits-lexicon=./vits-icefall-zh-aishell3/lexicon.txt \
--vits-tokens=./vits-icefall-zh-aishell3/tokens.txt \
--tts-rule-fsts=./vits-icefall-zh-aishell3/phone.fst,./vits-icefall-zh-aishell3/date.fst,./vits-icefall-zh-aishell3/number.fst \
--sid=103 \
--output-filename=./rule-103.wav \
"根据第7次全国人口普查结果表明,我国总人口有1443497378人。普查登记的大陆31个省、自治区、直辖市和现役军人的人口共1411778724人。电话号码是110。手机号是13812345678"
Wave filename | Content | Text |
---|---|---|
liube-21.wav | 勿以恶小而为之,勿以善小而不为。惟贤惟德,能服于人。 | |
demokelite-41.wav | 要留心,即使当你独自一人时,也不要说坏话或做坏事,而要学得在你自己面前比在别人面前更知耻。 | |
zhugeliang-45.wav | 夫君子之行,静以修身,俭以养德,非淡泊无以明志,非宁静无以致远。 | |
rule-103.wav | 根据第7次全国人口普查结果表明,我国总人口有1443497378人。普查登记的大陆31个省、自治区、直辖市和现役军人的人口共1411778724人。电话号码是110。手机号是13812345678 |
en_US-lessac-medium (English, single-speaker)
This model is converted from https://huggingface.co/rhasspy/piper-voices/tree/main/en/en_US/lessac/medium.
The dataset used to train the model is lessac_blizzard2013.
Hint
The model is from piper.
In the following, we describe how to download it and use it with sherpa-onnx.
Download the model
Please use the following commands to download it.
cd /path/to/sherpa-onnx
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-piper-en_US-lessac-medium.tar.bz2
tar xf vits-piper-en_US-lessac-medium.tar.bz2
Hint
You can find a lot of pre-trained models for over 40 languages at <https://github.com/k2-fsa/sherpa-onnx/releases/tag/tts-models>.
Generate speech with executable compiled from C++
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline-tts \
--vits-model=./vits-piper-en_US-lessac-medium/en_US-lessac-medium.onnx \
--vits-data-dir=./vits-piper-en_US-lessac-medium/espeak-ng-data \
--vits-tokens=./vits-piper-en_US-lessac-medium/tokens.txt \
--output-filename=./liliana-piper-en_US-lessac-medium.wav \
'liliana, the most beautiful and lovely assistant of our team!'
Hint
You can also use
cd /path/to/sherpa-onnx ./build/bin/sherpa-onnx-offline-tts-play \ --vits-model=./vits-piper-en_US-lessac-medium/en_US-lessac-medium.onnx \ --vits-data-dir=./vits-piper-en_US-lessac-medium/espeak-ng-data \ --vits-tokens=./vits-piper-en_US-lessac-medium/tokens.txt \ --output-filename=./liliana-piper-en_US-lessac-medium.wav \ 'liliana, the most beautiful and lovely assistant of our team!'which will play the audio as it is generating.
After running, it will generate a file liliana-piper.wav
in the current directory.
soxi ./liliana-piper-en_US-lessac-medium.wav
Input File : './liliana-piper-en_US-lessac-medium.wav'
Channels : 1
Sample Rate : 22050
Precision : 16-bit
Duration : 00:00:03.48 = 76800 samples ~ 261.224 CDDA sectors
File Size : 154k
Bit Rate : 353k
Sample Encoding: 16-bit Signed Integer PCM
Wave filename | Content | Text |
---|---|---|
liliana-piper-en_US-lessac-medium.wav | liliana, the most beautiful and lovely assistant of our team! |
Generate speech with Python script
cd /path/to/sherpa-onnx
python3 ./python-api-examples/offline-tts.py \
--vits-model=./vits-piper-en_US-lessac-medium/en_US-lessac-medium.onnx \
--vits-data-dir=./vits-piper-en_US-lessac-medium/espeak-ng-data \
--vits-tokens=./vits-piper-en_US-lessac-medium/tokens.txt \
--output-filename=./armstrong-piper-en_US-lessac-medium.wav \
"That's one small step for a man, a giant leap for mankind."
Hint
You can also use
cd /path/to/sherpa-onnx python3 ./python-api-examples/offline-tts-play.py \ --vits-model=./vits-piper-en_US-lessac-medium/en_US-lessac-medium.onnx \ --vits-data-dir=./vits-piper-en_US-lessac-medium/espeak-ng-data \ --vits-tokens=./vits-piper-en_US-lessac-medium/tokens.txt \ --output-filename=./armstrong-piper-en_US-lessac-medium.wav \ "That's one small step for a man, a giant leap for mankind."which will play the audio as it is generating.
After running, it will generate a file armstrong-piper-en_US-lessac-medium.wav
in the current directory.
soxi ./armstrong-piper-en_US-lessac-medium.wav
Input File : './armstrong-piper-en_US-lessac-medium.wav'
Channels : 1
Sample Rate : 22050
Precision : 16-bit
Duration : 00:00:03.74 = 82432 samples ~ 280.381 CDDA sectors
File Size : 165k
Bit Rate : 353k
Sample Encoding: 16-bit Signed Integer PCM
Wave filename | Content | Text |
---|---|---|
armstrong-piper-en_US-lessac-medium.wav | That's one small step for a man, a giant leap for mankind. |