VITS-LJSpeech
This tutorial shows you how to train an VITS model with the LJSpeech dataset.
Note
TTS related recipes require packages in requirements-tts.txt
.
Note
The VITS paper: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
Install extra dependencies
pip install piper_phonemize -f https://k2-fsa.github.io/icefall/piper_phonemize.html
pip install numba espnet_tts_frontend
Data preparation
$ cd egs/ljspeech/TTS
$ ./prepare.sh
To run stage 1 to stage 5, use
$ ./prepare.sh --stage 1 --stop_stage 5
Build Monotonic Alignment Search
$ ./prepare.sh --stage -1 --stop_stage -1
or
$ cd vits/monotonic_align
$ python setup.py build_ext --inplace
$ cd ../../
Training
$ export CUDA_VISIBLE_DEVICES="0,1,2,3"
$ ./vits/train.py \
--world-size 4 \
--num-epochs 1000 \
--start-epoch 1 \
--use-fp16 1 \
--exp-dir vits/exp \
--tokens data/tokens.txt \
--model-type high \
--max-duration 500
Note
You can adjust the hyper-parameters to control the size of the VITS model and
the training configurations. For more details, please run ./vits/train.py --help
.
Warning
If you want a model that runs faster on CPU, please use --model-type low
or --model-type medium
.
Note
The training can take a long time (usually a couple of days).
Training logs, checkpoints and tensorboard logs are saved in vits/exp
.
Inference
The inference part uses checkpoints saved by the training part, so you have to run the
training part first. It will save the ground-truth and generated wavs to the directory
vits/exp/infer/epoch-*/wav
, e.g., vits/exp/infer/epoch-1000/wav
.
$ export CUDA_VISIBLE_DEVICES="0"
$ ./vits/infer.py \
--epoch 1000 \
--exp-dir vits/exp \
--tokens data/tokens.txt \
--max-duration 500
Note
For more details, please run ./vits/infer.py --help
.
Export models
Currently we only support ONNX model exporting. It will generate one file in the given exp-dir
:
vits-epoch-*.onnx
.
$ ./vits/export-onnx.py \
--epoch 1000 \
--exp-dir vits/exp \
--tokens data/tokens.txt
You can test the exported ONNX model with:
$ ./vits/test_onnx.py \
--model-filename vits/exp/vits-epoch-1000.onnx \
--tokens data/tokens.txt
Download pretrained models
If you don’t want to train from scratch, you can download the pretrained models by visiting the following link:
--model-type=high
: https://huggingface.co/Zengwei/icefall-tts-ljspeech-vits-2024-02-28
--model-type=medium
: https://huggingface.co/csukuangfj/icefall-tts-ljspeech-vits-medium-2024-03-12
--model-type=low
: https://huggingface.co/csukuangfj/icefall-tts-ljspeech-vits-low-2024-03-12
Usage in sherpa-onnx
The following describes how to test the exported ONNX model in sherpa-onnx.
Hint
sherpa-onnx supports different programming languages, e.g., C++, C, Python, Kotlin, Java, Swift, Go, C#, etc. It also supports Android and iOS.
We only describe how to use pre-built binaries from sherpa-onnx below. Please refer to https://k2-fsa.github.io/sherpa/onnx/ for more documentation.
Install sherpa-onnx
pip install sherpa-onnx
To check that you have installed sherpa-onnx successfully, please run:
which sherpa-onnx-offline-tts
sherpa-onnx-offline-tts --help
Download lexicon files
cd /tmp
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/espeak-ng-data.tar.bz2
tar xf espeak-ng-data.tar.bz2
Run sherpa-onnx
cd egs/ljspeech/TTS
sherpa-onnx-offline-tts \
--vits-model=vits/exp/vits-epoch-1000.onnx \
--vits-tokens=data/tokens.txt \
--vits-data-dir=/tmp/espeak-ng-data \
--num-threads=1 \
--output-filename=./high.wav \
"Ask not what your country can do for you; ask what you can do for your country."
Hint
You can also use sherpa-onnx-offline-tts-play
to play the audio
as it is generating.
You should get a file high.wav
after running the above command.
Congratulations! You have successfully trained and exported a text-to-speech model and run it with sherpa-onnx.