VITS

1 Overview¶

1.1 Background Introduction¶

VITS is an end-to-end speech synthesis algorithm that uses a pre-trained speech encoder to directly convert text into speech without requiring additional intermediate steps or feature extraction. For more details, you can refer to the official VITS link:

https://github.com/jaywalnut310/vits

Since the officially provided VITS model is trained on English, we found another VITS repository trained on Chinese:

https://github.com/ywwwf/vits-mandarin-windows

The model download link is:

https://pan.baidu.com/s/1pN-wL_5wB9gYMAr2Mh7Jvg, password: vits

1.2 Usage Instructions¶

The Linux SDK-alkaid comes with pre-converted offline models and board-side examples by default. The relevant file paths are as follows:

Board-side example program path Linux_SDK/sdk/verify/opendla/source/tts/vits
Board-side offline model path Linux_SDK/project/board/${chip}/dla_file/ipu_open_models/tts/vits.img

If the user does not need to convert the model, they can jump directly to section 3.

2 Model Conversion¶

2.1 ONNX Model Conversion¶

Setting up the Python environment $conda create -n vits python==3.10 $conda activate vits $git clone https://github.com/ywwwf/vits-mandarin-windows.git $cd vits-mandarin-windows $pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple $cd monotonic_align $python setup.py build_ext --inplace $cd .. Note: The provided Python environment setup is only a reference example; for the specific setup process, please refer to the official source running tutorial:
```
https://github.com/ywwwf/vits-mandarin-windows/blob/master/README.md
```
Model Testing
- Create a models folder and place the downloaded model in this directory, then run the model testing script to ensure the VITS environment is configured correctly $python inference.py

Model Modification

Modify the original code to change the dynamic length input to fixed length input

In the transforms.py file, from line 55 to line 94

Original code

def unconstrained_rational_quadratic_spline(inputs,
                            unnormalized_widths,
                            unnormalized_heights,
                            unnormalized_derivatives,
                            inverse=False,
                            tails='linear',
                            tail_bound=1.,
                            min_bin_width=DEFAULT_MIN_BIN_WIDTH,
                            min_bin_height=DEFAULT_MIN_BIN_HEIGHT,
                            min_derivative=DEFAULT_MIN_DERIVATIVE):
    inside_interval_mask = (inputs >= -tail_bound) & (inputs <= tail_bound)
    outside_interval_mask = ~inside_interval_mask

    outputs = torch.zeros_like(inputs)
    logabsdet = torch.zeros_like(inputs)

    if tails == 'linear':
        unnormalized_derivatives = F.pad(unnormalized_derivatives, pad=(1, 1))
        constant = np.log(np.exp(1 - min_derivative) - 1)
        unnormalized_derivatives[..., 0] = constant
        unnormalized_derivatives[..., -1] = constant

        outputs[outside_interval_mask] = inputs[outside_interval_mask]
        logabsdet[outside_interval_mask] = 0
    else:
        raise RuntimeError('{} tails are not implemented.'.format(tails))

    outputs[inside_interval_mask], logabsdet[inside_interval_mask] = rational_quadratic_spline(
    inputs=inputs[inside_interval_mask],
    unnormalized_widths=unnormalized_widths[inside_interval_mask, :],
    unnormalized_heights=unnormalized_heights[inside_interval_mask, :],
    unnormalized_derivatives=unnormalized_derivatives[inside_interval_mask, :],
    inverse=inverse,
    left=-tail_bound, right=tail_bound, bottom=-tail_bound, top=tail_bound,
    min_bin_width=min_bin_width,
    min_bin_height=min_bin_height,
    min_derivative=min_derivative
    )

    return outputs, logabsdet

Modified code

def unconstrained_rational_quadratic_spline(inputs,
                            unnormalized_widths,
                            unnormalized_heights,
                            unnormalized_derivatives,
                            inverse=False,
                            tails='linear',
                            tail_bound=1.,
                            min_bin_width=DEFAULT_MIN_BIN_WIDTH,
                            min_bin_height=DEFAULT_MIN_BIN_HEIGHT,
                            min_derivative=DEFAULT_MIN_DERIVATIVE):
        inside_interval_mask = (~(inputs < -tail_bound)) & (~(inputs > tail_bound))
        outside_interval_mask = ~inside_interval_mask

        outputs = torch.zeros_like(inputs)
        logabsdet = torch.zeros_like(inputs)

        if tails == 'linear':
            unnormalized_derivatives = F.pad(unnormalized_derivatives, pad=(1, 1))
            constant = np.log(np.exp(1 - min_derivative) - 1)
            unnormalized_derivatives[..., 0] = constant
            unnormalized_derivatives[..., -1] = constant

            outputs = inputs * outside_interval_mask
        else:
            raise RuntimeError('{} tails are not implemented.'.format(tails))

        outputs, logabsdet = rational_quadratic_spline(
            inputs=inputs,
            unnormalized_widths=unnormalized_widths,
            unnormalized_heights=unnormalized_heights,
            unnormalized_derivatives=unnormalized_derivatives,
            inverse=inverse,
            left=-tail_bound, right=tail_bound, bottom=-tail_bound, top=tail_bound,
            min_bin_width=min_bin_width,
            min_bin_height=min_bin_height,
            min_derivative=min_derivative
        )
        outputs = outputs * inside_interval_mask
        logabsdet = logabsdet * inside_interval_mask

        return outputs, logabsdet

models.py

Line 50:
- Original code def forward(self, x, x_mask, w=None, g=None, reverse=False, noise_scale=1.0):
- Modified code def forward(self, x, randn, x_mask, w=None, g=None, reverse=False, noise_scale=1.0):
Line 90
- Original code z = torch.randn(x.size(0), 2, x.size(2)).to(device=x.device, dtype=x.dtype) * noise_scale
- Modified code z = randn.to(device=x.device, dtype=x.dtype) * noise_scale

Lines 167 to 176

Original code def forward(self, x, x_lengths): x = self.emb(x) * math.sqrt(self.hidden_channels) # [b, t, h] x = torch.transpose(x, 1, -1) # [b, h, t] x_mask = torch.unsqueeze(commons.sequence_mask(x_lengths, x.size(2)), 1).to(x.dtype)
```
    x = self.encoder(x * x_mask, x_mask)
    stats = self.proj(x) * x_mask

    m, logs = torch.split(stats, self.out_channels, dim=1)
    return x, m, logs, x_mask
```

Modified code

def forward(self, x, x_mask):
    x = self.emb(x) * math.sqrt(self.hidden_channels) # [b, t, h]
    x = torch.transpose(x, 1, -1) # [b, h, t]
    # x_mask = torch.unsqueeze(commons.sequence_mask(x_lengths, x.size(2)), 1).to(x.dtype)

    x = self.encoder(x * x_mask, x_mask)
    stats = self.proj(x) * x_mask

    m, logs = torch.split(stats, self.out_channels, dim=1)
    return x, m, logs, x_mask

Line 240
- Original code z = (m + torch.randn_like(m) * torch.exp(logs)) * x_mask
- Modified code z = (m + torch.exp(logs)) * x_mask

Lines 499 to 523

Original code def infer(self, x, x_lengths, sid=None, noise_scale=1, length_scale=1, noise_scale_w=1., max_len=None): x, m_p, logs_p, x_mask = self.enc_p(x, x_lengths) if self.n_speakers > 0: g = self.emb_g(sid).unsqueeze(-1) # [b, h, 1] else: g = None

    if self.use_sdp:
        logw = self.dp(x, x_mask, g=g, reverse=True, noise_scale=noise_scale_w)
    else:
        logw = self.dp(x, x_mask, g=g)
    w = torch.exp(logw) * x_mask * length_scale
    w_ceil = torch.ceil(w)
    y_lengths = torch.clamp_min(torch.sum(w_ceil, [1, 2]), 1).long()
    y_mask = torch.unsqueeze(commons.sequence_mask(y_lengths, None), 1).to(x_mask.dtype)
    attn_mask = torch.unsqueeze(x_mask, 2) * torch.unsqueeze(y_mask, -1)
    attn = commons.generate_path(w_ceil, attn_mask)

    m_p = torch.matmul(attn.squeeze(1), m_p.transpose(1, 2)).transpose(1, 2) # [b, t', t], [b, t, d] -> [b, d, t']
    logs_p = torch.matmul(attn.squeeze(1), logs_p.transpose(1, 2)).transpose(1, 2) # [b, t', t], [b, t, d] -> [b, d, t']

    z_p = m_p + torch.randn_like(m_p) * torch.exp(logs_p) * noise_scale
    z = self.flow(z_p, y_mask, g=g, reverse=True)
    o = self.dec((z * y_mask)[:,:,:max_len], g=g)
    return o, attn, y_mask, (z, z_p, m_p, logs_p)

Modified code

def infer(self, x, x_mask, sid=None, noise_scale=1, z_fixed=None, max_len=None):
    max_length=1000
    x, m_p, logs_p, x_mask = self.enc_p(x, x_mask)
    if self.n_speakers > 0:
    g = self.emb_g(sid).unsqueeze(-1) # [b, h, 1]
    else:
    g = None
    randn = z_fixed[:,:2,:500]
    if self.use_sdp:
    logw = self.dp(x, randn, x_mask, g=g, reverse=True, noise_scale=self.noise_scale_w)
    else:
    logw = self.dp(x, randn, x_mask, g=g)
    w = torch.exp(logw) * x_mask * self.length_scale

    # w_ceil = torch.ceil(w)
    w_ceil = -torch.floor(-w)

    y_lengths = torch.clamp_min(torch.sum(w_ceil, [1, 2]), 1).long()
    # y_mask = torch.unsqueeze(commons.sequence_mask(y_lengths, None), 1).to(x_mask.dtype)
    y_mask = torch.unsqueeze(commons.sequence_mask(y_lengths, max_length), 1).to(x_mask.dtype)
    attn_mask = torch.unsqueeze(x_mask, 2) * torch.unsqueeze(y_mask, -1)
    attn = commons.generate_path(w_ceil, attn_mask)

    m_p = torch.matmul(attn.squeeze(1), m_p.transpose(1, 2)).transpose(1, 2) # [b, t', t], [b, t, d] -> [b, d, t']
    logs_p = torch.matmul(attn.squeeze(1), logs_p.transpose(1, 2)).transpose(1, 2) # [b, t', t], [b, t, d] -> [b, d, t']

    z_p = m_p + z_fixed * torch.exp(logs_p) * noise_scale[0]
    z = self.flow(z_p, y_mask, g=g, reverse=True)
    o = self.dec((z * y_mask)[:,:,:max_len], g=g)
    # return o, attn, y_mask, (z, z_p, m_p, logs_p)
    return o, y_lengths

attention.py Lines 165 to 170

Original code

if mask is not None:
    scores = scores.masked_fill(mask == 0, -1e4)
    if self.block_length is not None:
        assert t_s == t_t, "Local attention is only available for self-attention."
        block_mask = torch.ones_like(scores).triu(-self.block_length).tril(self.block_length)
        scores = scores.masked_fill(block_mask == 0, -1e4)

Modified code

if mask is not None:
    scores = scores.masked_fill(mask == 0, -1e4)
    scores = scores - (1-mask) * 104
    if self.block_length is not None:
        assert t_s == t_t, "Local attention is only available for self-attention."
        block_mask = torch.ones_like(scores).triu(-self.block_length).tril(self.block_length)
        # scores = scores.masked_fill(block_mask == 0, -103)
        scores = scores - (1-block_mask) * 104

Model Export

Write the model conversion script export_onnx.py:

import os
import time
import random
import numpy as np
import torch
from scipy.io.wavfile import write
import soundfile as sf

import commons
import utils
from models import SynthesizerTrn
from text import create_symbols_manager, text_to_sequence, cleaned_text_to_sequence, _clean_text
import argparse

import onnx
from onnxsim import simplify
import onnxruntime
from thop import profile
from torchsummary import summary

class AudioGenerator():
    def __init__(self, hparams, device):
        self.hparams = hparams
        self._device = device

        if 'language' in hparams.data:
            symbols_manager = create_symbols_manager(hparams.data.language)
        else:
            symbols_manager = create_symbols_manager('default')
        self.symbol_to_id = symbols_manager._symbol_to_id

        self.net_g = create_network(hparams, symbols_manager.symbols, device)

    def load(self, path):
        load_checkpoint(self.net_g, path)

    def inference(self, text, phoneme_mode=False):
        return do_inference(self.net_g, self.hparams, self.symbol_to_id, text, phoneme_mode, self._device)

def get_text(text, hparams, symbol_to_id, phoneme_mode=False):
    if not phoneme_mode:
        print("1: ", _clean_text(text, hparams.data.text_cleaners))
        text_norm = text_to_sequence(text, hparams.data.text_cleaners, symbol_to_id)
    else:
        print("2: ", text)
        text_norm = cleaned_text_to_sequence(text, symbol_to_id)

    if hparams.data.add_blank:
        text_norm = commons.intersperse(text_norm, 0)
    text_norm = torch.LongTensor(text_norm)
    return text_norm

def create_network(hparams, symbols, device):
    net_g = SynthesizerTrn(
        len(symbols),
        hparams.data.filter_length // 2 + 1,
        hparams.train.segment_size // hparams.data.hop_length,
        **hparams.model).to(device)
    _ = net_g.eval()

    return net_g

def load_checkpoint(network, path):
    _ = utils.load_checkpoint(path, network, None)

# Assume the network has loaded weights and are ready to do inference
def do_inference(generator, hparams, symbol_to_id, text, phoneme_mode=False, device=torch.device('cpu')):
    stn_tst = get_text(text, hparams, symbol_to_id, phoneme_mode)

    with torch.no_grad():
        x_tst = stn_tst.to(device).unsqueeze(0).int()
        x_tst = torch.cat([x_tst, torch.zeros(1, 500 - x_tst.size(1))],dim=1).int()
        x_tst_lengths = torch.tensor([stn_tst.size(0)])# 1x500
        x_mask = torch.unsqueeze(commons.sequence_mask(x_tst_lengths, x_tst.size(1)), 1).float()
        # noise_scale = 0.667
        # noise_scale_w = 0.8
        noise_scale = torch.tensor([random.uniform(0, 1)])
        noise_scale_w = torch.tensor([random.uniform(0, 1)])
        print(f"The noise ncale is {noise_scale}")
        print(f"The noise scale_w is {noise_scale_w}")
        max_length=1000
        input_tensor = torch.randn(1, hparams.model.hidden_channels, max_length, device="cuda", dtype=torch.float32)
        z_fixed = torch.randn_like(input_tensor).to(x_tst.device)
        #audio_pt = generator.infer(x_tst.int(), x_mask, None, noise_scale_w, z_fixed)
        dummy_input = (x_tst.int(), x_mask, None, noise_scale_w, z_fixed)

        generator.forward = generator.infer
        torch.onnx.export(
                model=generator,
                args=dummy_input,
                f='./models/vits.onnx',
                input_names=["input", "mask", "noise_scale", "z_fixed"],
                output_names=["z", "y_lengths"],
                opset_version=13,
                export_params=True,
                verbose=False
            )

        model = onnx.load('./models/vits.onnx')
        model_simp, check = simplify(model)
        export_name = './models/vits_sim.onnx'
        onnx.save(model_simp, export_name)
        # exit(1)
        # import pdb
        # pdb.set_trace()
        np.save("./npy_data/x_tst_0.npy", x_tst)
        np.save("./npy_data/x_mask_0.npy", x_mask)
        np.save("./npy_data/noise_scale_w_0.npy", noise_scale_w)
        np.save("./npy_data/z_fixed_0.npy", z_fixed)

        onnx_session_static = onnxruntime.InferenceSession('./models/vits_sim.onnx', providers=['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider'])
        input_names_static = [input.name for input in onnx_session_static.get_inputs()]
        output_names_static = [output.name for output in onnx_session_static.get_outputs()]
        data_static = {input_names_static[0]: np.array(x_tst, dtype=np.int32),
                        input_names_static[1]: np.array(x_mask, dtype=np.float32),
                        input_names_static[2]: np.array(noise_scale_w, dtype=np.float32),
                        input_names_static[3]: np.array(z_fixed, dtype=np.float32),}
        data_out_static = onnx_session_static.run(output_names_static, data_static)
        audio=data_out_static[0][0][0]
        y_length = data_out_static[1]

    return audio,y_length

def save_to_wav(data, sampling_rate, path):
    sf.write(path, data, 22050, 'PCM_16')

if __name__ == "__main__":
    config_path = "./config/bb_v100.json"
    hps = utils.get_hparams_from_file(config_path)
    audio_generator = AudioGenerator(hps, "cpu")
    checkpoint_path = "./models/G_bb_v100_820000.pth"
    audio_generator.load(checkpoint_path)
    phoneme_mode = False
    do_noise_reduction = True

    text = "他的到来是一件好事, 我很欢迎他,  大家好, 我是御坂美琴的儿子。"
    start = time.perf_counter()
    audio,y_length = audio_generator.inference(text, phoneme_mode)

    print(f"The inference takes {time.perf_counter() - start} seconds")

    print(audio.dtype)

    if do_noise_reduction:
        import noisereduce as nr

        # perform noise reduction
        audio = nr.reduce_noise(y=audio, sr=hps.data.sampling_rate)

    output_dir = './output/'
    # python program to check if a path exists
    # if it doesn’t exist we create one
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
    filename = 'output.wav'
    file_path = os.path.join(output_dir, filename)

    save_to_wav(audio, hps.data.sampling_rate, file_path)

2.2 Offline Model Conversion¶

2.2.1 Preprocessing & Postprocessing Instructions¶

Preprocessing The input information for the VITS model is shown in the figure below, and the model has four inputs. Among them, input is the input token_id, mask is used to record the positions of valid token_ids, and noise_scale and z_fixed are random values used to enrich the model's output diversity. Taking Chinese as an example, converting a segment of input text into the model's input requires the following steps: 1) Use the pypinyin package to convert Chinese to Pinyin. 2) Split the Pinyin, look up the index value corresponding to each phoneme through a dictionary, and convert the phonemes to tokens. 3) Pad each token with 0 before and after to indicate pauses. Finally, convert it to token_id.

Postprocessing VITS does not have postprocessing; the output tensor is directly converted to audio using the wav tool. The model output information is as follows:

2.2.2 Offline Model Conversion Process¶

Note: 1) OpenDLAModel corresponds to the smodel file extracted from the compressed package image-dev_model_convert.tar. 2) The conversion command needs to be run in a Docker environment; please load the SGS Docker environment according to the Docker development environment tutorial.

Copy the ONNX model to the conversion code directory $cp models/vits_sim.onnx OpenDLAModel/tts/vits/onnx
Conversion command $cd IPU_SDK_Release/docker $bash run_docker.sh # Enter the OpenDLAModel directory in the Docker environment $cd /work/SGS_XXX/OpenDLAModel $bash convert.sh -a tts/vits -c config/tts_vits.cfg -p SGS_IPU_Toolchain (absolute path) -s false
Final generated model locations output/ ${chip}_$ /vits.img output/ ${chip}_$ /vits_fixed.sim output/ ${chip}_$ /vits_float.sim

2.2.3 Key Script Parameter Analysis¶

-   input_config.ini
        [INPUT_CONFIG]
        inputs=speech,mask,noise_scale,z_fixed;                                             # ONNX input node names, separated by commas if there are multiple;
        input_formats=RAWDATA_S16_NHWC,RAWDATA_F32_NHWC,RAWDATA_F32_NHWC,RAWDATA_F32_NHWC;  # Board-side input formats, can be selected according to ONNX input format, e.g., float: RAWDATA_F32_NHWC, int32: RAWDATA_S16_NHWC;
        quantizations=TRUE,TRUE,TRUE,TRUE;                                                  # Enable input quantization, do not modify;
        [OUTPUT_CONFIG]
        outputs=z,y_lengths;                                # ONNX output node names, separated by commas if there are multiple;
        dequantizations=TRUE,TRUE;                          # Whether to enable dequantization, fill according to actual needs, recommended to be TRUE. If set to False, output will be int16; if set to True, output will be float32.

-   tts_vits.cfg
        [VITS]
        CHIP_LIST=pcupid                    # Platform name, must match the board platform, otherwise the model cannot run
        Model_LIST=vits_sim                 # Input ONNX model name
        INPUT_SIZE_LIST=0                   # Model input resolution
        INPUT_INI_LIST=input_config.ini     # Configuration file
        CLASS_NUM_LIST=0                    # Just fill in 0
        SAVE_NAME_LIST=vits.img             # Output model name
        QUANT_DATA_PATH=image_lists.txt     # Quantization data path

2.3 Model Simulation¶

Obtain float/fixed/offline model outputs $bash convert.sh -a tts/vits -c config/tts_vits.cfg -p SGS_IPU_Toolchain (absolute path) -s true After executing the above command, the float model's output tensor will be saved by default in a txt file under the path tts/vits/log/output. In addition, the tts/vits/convert.sh script also provides simulation examples for fixed and offline; users can obtain outputs for the fixed and offline models by uncommenting code blocks during execution.
Model Accuracy Comparison With the input being the same as the aforementioned models, enter the environment built in section 2.1, and print the result at line 351 in the vits-mandarin-windows/export_onnx.py file: print(audio) This will obtain the output tensor of the corresponding node in the PyTorch model, allowing for comparison with float, fixed, and offline models. It should also be noted that the original model's output format is NCHW, while the output format of the float/fixed/offline models is NHWC.

3 Board-side Deployment¶

3.1 Program Compilation¶

Before compiling the example program, it is necessary to select the appropriate deconfig based on the board (nand/nor/emmc, ddr model, etc.) for the complete SDK compilation. For details, refer to the alkaid SDK sigdoc document "Development Environment Setup." - Compile the board-side VITS example. $cd sdk/verify/opendla $make clean && make source/tts/vits -j8 - Final generated executable file location sdk/verify/opendla/out/$ /app/prog_tts_vits

3.2 Running Files¶

When running the program, you need to copy the following files to the board:

prog_tts_vits
vits.img
zh_tn_tagger.fst
zh_tn_verbalizer.fst

3.3 Running Instructions¶

Usage: ./prog_tts_vits txt model zh_tn_tagger.fst zh_tn_verbalizer.fst (command to run the executable)
Required Input:
- txt: path to the input text
- model: path to the offline model to be tested
- zh_tn_tagger.fst: third-party resource required for text preprocessing, used for Chinese normalization
- zh_tn_verbalizer.fst: third-party resource required for text preprocessing, used for Chinese normalization
Example input text: input_word.txt The United States is one of China's trade partners, and the current trade pattern between China and the United States is the result of supply and demand matching and market allocation. Art is the horn of the era's advancement, and artists are the engineers of the soul.
Typical Output: ./prog_tts_vits resource/input_word.txt models/vits.img resource/wetext/zh_tn_tagger.fst resour ce/wetext/zh_tn_verbalizer.fst input text: The United States is one of China's trade partners, and the current trade pattern between China and the United States is the result of supply and demand matching and market allocation. model invoke time: 2659.814000 ms Generated: 220416 samples of audio 0.000000 -0.000093 0.000521 -0.000707 0.000558 -0.000391 WAV file 'output_0.wav' has been written input text: Art is the horn of the era's advancement, and artists are the engineers of the soul. model invoke time: 2659.969000 ms Generated: 106496 samples of audio 0.000000 -0.000093 0.000521 -0.000707 0.000521 -0.000391 WAV file 'output_1.wav' has been written