MMS
This section describes how to convert models from https://huggingface.co/facebook/mms-tts/tree/main to sherpa-onnx.
Note that facebook/mms-tts supports more than 1000 languages. You can try models from facebook/mms-tts at the huggingface space https://huggingface.co/spaces/mms-meta/MMS.
You can try the converted models by visiting https://huggingface.co/spaces/k2-fsa/text-to-speech.
To download the converted models, please visit https://github.com/k2-fsa/sherpa-onnx/releases/tag/tts-models.
If a filename contains vits-mms
, it means the model is from
facebook/mms-tts.
Install dependencies
pip install -qq onnx scipy Cython
pip install -qq torch==1.13.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
Download the model file
Suppose that we want to convert the English model, we need to use the following commands to download the model:
name=eng
wget -q https://huggingface.co/facebook/mms-tts/resolve/main/models/$name/G_100000.pth
wget -q https://huggingface.co/facebook/mms-tts/resolve/main/models/$name/config.json
wget -q https://huggingface.co/facebook/mms-tts/resolve/main/models/$name/vocab.txt
Download MMS source code
git clone https://huggingface.co/spaces/mms-meta/MMS
export PYTHONPATH=$PWD/MMS:$PYTHONPATH
export PYTHONPATH=$PWD/MMS/vits:$PYTHONPATH
pushd MMS/vits/monotonic_align
python3 setup.py build
ls -lh build/
ls -lh build/lib*/
ls -lh build/lib*/*/
cp build/lib*/vits/monotonic_align/core*.so .
sed -i.bak s/.monotonic_align.core/.core/g ./__init__.py
popd
Convert the model
Please save the following code into a file with name ./vits-mms.py
:
#!/usr/bin/env python3
import collections
import os
from typing import Any, Dict
import onnx
import torch
from vits import commons, utils
from vits.models import SynthesizerTrn
class OnnxModel(torch.nn.Module):
def __init__(self, model: SynthesizerTrn):
super().__init__()
self.model = model
def forward(
self,
x,
x_lengths,
noise_scale=0.667,
length_scale=1.0,
noise_scale_w=0.8,
):
return self.model.infer(
x=x,
x_lengths=x_lengths,
noise_scale=noise_scale,
length_scale=length_scale,
noise_scale_w=noise_scale_w,
)[0]
def add_meta_data(filename: str, meta_data: Dict[str, Any]):
"""Add meta data to an ONNX model. It is changed in-place.
Args:
filename:
Filename of the ONNX model to be changed.
meta_data:
Key-value pairs.
"""
model = onnx.load(filename)
for key, value in meta_data.items():
meta = model.metadata_props.add()
meta.key = key
meta.value = str(value)
onnx.save(model, filename)
def load_vocab():
return [
x.replace("\n", "") for x in open("vocab.txt", encoding="utf-8").readlines()
]
@torch.no_grad()
def main():
hps = utils.get_hparams_from_file("config.json")
is_uroman = hps.data.training_files.split(".")[-1] == "uroman"
if is_uroman:
raise ValueError("We don't support uroman!")
symbols = load_vocab()
# Now generate tokens.txt
all_upper_tokens = [i.upper() for i in symbols]
duplicate = set(
[
item
for item, count in collections.Counter(all_upper_tokens).items()
if count > 1
]
)
print("generate tokens.txt")
with open("tokens.txt", "w", encoding="utf-8") as f:
for idx, token in enumerate(symbols):
f.write(f"{token} {idx}\n")
# both upper case and lower case correspond to the same ID
if (
token.lower() != token.upper()
and len(token.upper()) == 1
and token.upper() not in duplicate
):
f.write(f"{token.upper()} {idx}\n")
net_g = SynthesizerTrn(
len(symbols),
hps.data.filter_length // 2 + 1,
hps.train.segment_size // hps.data.hop_length,
**hps.model,
)
net_g.cpu()
_ = net_g.eval()
_ = utils.load_checkpoint("G_100000.pth", net_g, None)
model = OnnxModel(net_g)
x = torch.randint(low=1, high=10, size=(50,), dtype=torch.int64)
x = x.unsqueeze(0)
x_length = torch.tensor([x.shape[1]], dtype=torch.int64)
noise_scale = torch.tensor([1], dtype=torch.float32)
length_scale = torch.tensor([1], dtype=torch.float32)
noise_scale_w = torch.tensor([1], dtype=torch.float32)
opset_version = 13
filename = "model.onnx"
torch.onnx.export(
model,
(x, x_length, noise_scale, length_scale, noise_scale_w),
filename,
opset_version=opset_version,
input_names=[
"x",
"x_length",
"noise_scale",
"length_scale",
"noise_scale_w",
],
output_names=["y"],
dynamic_axes={
"x": {0: "N", 1: "L"}, # n_audio is also known as batch_size
"x_length": {0: "N"},
"y": {0: "N", 2: "L"},
},
)
meta_data = {
"model_type": "vits",
"comment": "mms",
"url": "https://huggingface.co/facebook/mms-tts/tree/main",
"add_blank": int(hps.data.add_blank),
"language": os.environ.get("language", "unknown"),
"frontend": "characters",
"n_speakers": int(hps.data.n_speakers),
"sample_rate": hps.data.sampling_rate,
}
print("meta_data", meta_data)
add_meta_data(filename=filename, meta_data=meta_data)
main()
The you can run it with:
export PYTHONPATH=$PWD/MMS:$PYTHONPATH
export PYTHONPATH=$PWD/MMS/vits:$PYTHONPATH
export lang=eng
python3 ./vits-mms.py
It will generate the following two files:
model.onnx
tokens.txt
Use the converted model
We can use the converted model with the following command after installing sherpa-onnx.
./build/bin/sherpa-onnx-offline-tts \
--vits-model=./model.onnx \
--vits-tokens=./tokens.txt \
--debug=1 \
--output-filename=./mms-eng.wav \
"How are you doing today? This is a text-to-speech application using models from facebook with next generation Kaldi"
The above command should generate a wave file mms-eng.wav
.
Wave filename | Content | Text |
---|---|---|
mms-eng.wav | How are you doing today? This is a text-to-speech application using models from facebook with next generation Kaldi |
Congratulations! You have successfully converted a model from MMS and run it with sherpa-onnx.
We are using eng
in this section as an example, you can replace it with other languages, such as
deu
for German, fra
for French, etc.