VITS-LJSpeech

This tutorial shows you how to train an VITS model with the LJSpeech dataset.

Note

TTS related recipes require packages in requirements-tts.txt.

Note

The VITS paper: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Install extra dependencies

pip install piper_phonemize -f https://k2-fsa.github.io/icefall/piper_phonemize.html
pip install numba espnet_tts_frontend

Data preparation

$ cd egs/ljspeech/TTS
$ ./prepare.sh

To run stage 1 to stage 5, use

$ ./prepare.sh --stage 1 --stop_stage 5

Build Monotonic Alignment Search

$ ./prepare.sh --stage -1 --stop_stage -1

or

$ cd vits/monotonic_align
$ python setup.py build_ext --inplace
$ cd ../../

Training

$ export CUDA_VISIBLE_DEVICES="0,1,2,3"
$ ./vits/train.py \
    --world-size 4 \
    --num-epochs 1000 \
    --start-epoch 1 \
    --use-fp16 1 \
    --exp-dir vits/exp \
    --tokens data/tokens.txt \
    --model-type high \
    --max-duration 500

Note

You can adjust the hyper-parameters to control the size of the VITS model and the training configurations. For more details, please run ./vits/train.py --help.

Warning

If you want a model that runs faster on CPU, please use --model-type low or --model-type medium.

Note

The training can take a long time (usually a couple of days).

Training logs, checkpoints and tensorboard logs are saved in vits/exp.

Inference

The inference part uses checkpoints saved by the training part, so you have to run the training part first. It will save the ground-truth and generated wavs to the directory vits/exp/infer/epoch-*/wav, e.g., vits/exp/infer/epoch-1000/wav.

$ export CUDA_VISIBLE_DEVICES="0"
$ ./vits/infer.py \
    --epoch 1000 \
    --exp-dir vits/exp \
    --tokens data/tokens.txt \
    --max-duration 500

Note

For more details, please run ./vits/infer.py --help.

Export models

Currently we only support ONNX model exporting. It will generate one file in the given exp-dir: vits-epoch-*.onnx.

$ ./vits/export-onnx.py \
    --epoch 1000 \
    --exp-dir vits/exp \
    --tokens data/tokens.txt

You can test the exported ONNX model with:

$ ./vits/test_onnx.py \
    --model-filename vits/exp/vits-epoch-1000.onnx \
    --tokens data/tokens.txt

Download pretrained models

If you don’t want to train from scratch, you can download the pretrained models by visiting the following link:

--model-type=high: https://huggingface.co/Zengwei/icefall-tts-ljspeech-vits-2024-02-28

--model-type=medium: https://huggingface.co/csukuangfj/icefall-tts-ljspeech-vits-medium-2024-03-12

--model-type=low: https://huggingface.co/csukuangfj/icefall-tts-ljspeech-vits-low-2024-03-12

Usage in sherpa-onnx

The following describes how to test the exported ONNX model in sherpa-onnx.

Hint

sherpa-onnx supports different programming languages, e.g., C++, C, Python, Kotlin, Java, Swift, Go, C#, etc. It also supports Android and iOS.

We only describe how to use pre-built binaries from sherpa-onnx below. Please refer to https://k2-fsa.github.io/sherpa/onnx/ for more documentation.

Install sherpa-onnx

pip install sherpa-onnx

To check that you have installed sherpa-onnx successfully, please run:

which sherpa-onnx-offline-tts
sherpa-onnx-offline-tts --help

Download lexicon files

cd /tmp
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/espeak-ng-data.tar.bz2
tar xf espeak-ng-data.tar.bz2

Run sherpa-onnx

cd egs/ljspeech/TTS

sherpa-onnx-offline-tts \
  --vits-model=vits/exp/vits-epoch-1000.onnx \
  --vits-tokens=data/tokens.txt \
  --vits-data-dir=/tmp/espeak-ng-data \
  --num-threads=1 \
  --output-filename=./high.wav \
  "Ask not what your country can do for you; ask what you can do for your country."

Hint

You can also use sherpa-onnx-offline-tts-play to play the audio as it is generating.

You should get a file high.wav after running the above command.

Congratulations! You have successfully trained and exported a text-to-speech model and run it with sherpa-onnx.