Skip to content

VITS

1 Overview

1.1 Background Introduction

VITS is an end-to-end speech synthesis algorithm that uses a pre-trained speech encoder to directly convert text into speech without requiring additional intermediate steps or feature extraction. For more details, you can refer to the official VITS link:

https://github.com/jaywalnut310/vits

Since the officially provided VITS model is trained on English, we found another VITS repository trained on Chinese:

https://github.com/ywwwf/vits-mandarin-windows

The model download link is:

https://pan.baidu.com/s/1pN-wL_5wB9gYMAr2Mh7Jvg, password: vits

1.2 Usage Instructions

The Linux SDK-alkaid comes with pre-converted offline models and board-side examples by default. The relevant file paths are as follows:

  • Board-side example program path Linux_SDK/sdk/verify/opendla/source/tts/vits
  • Board-side offline model path Linux_SDK/project/board/${chip}/dla_file/ipu_open_models/tts/vits.img

If the user does not need to convert the model, they can jump directly to section 3.

2 Model Conversion

2.1 ONNX Model Conversion

  • Setting up the Python environment $conda create -n vits python==3.10 $conda activate vits $git clone https://github.com/ywwwf/vits-mandarin-windows.git $cd vits-mandarin-windows $pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple $cd monotonic_align $python setup.py build_ext --inplace $cd .. Note: The provided Python environment setup is only a reference example; for the specific setup process, please refer to the official source running tutorial:

    https://github.com/ywwwf/vits-mandarin-windows/blob/master/README.md
    
  • Model Testing

    • Create a models folder and place the downloaded model in this directory, then run the model testing script to ensure the VITS environment is configured correctly $python inference.py
  • Model Modification

    • Modify the original code to change the dynamic length input to fixed length input

      • In the transforms.py file, from line 55 to line 94

        • Original code

          def unconstrained_rational_quadratic_spline(inputs,
                                      unnormalized_widths,
                                      unnormalized_heights,
                                      unnormalized_derivatives,
                                      inverse=False,
                                      tails='linear',
                                      tail_bound=1.,
                                      min_bin_width=DEFAULT_MIN_BIN_WIDTH,
                                      min_bin_height=DEFAULT_MIN_BIN_HEIGHT,
                                      min_derivative=DEFAULT_MIN_DERIVATIVE):
              inside_interval_mask = (inputs >= -tail_bound) & (inputs <= tail_bound)
              outside_interval_mask = ~inside_interval_mask
          
              outputs = torch.zeros_like(inputs)
              logabsdet = torch.zeros_like(inputs)
          
              if tails == 'linear':
                  unnormalized_derivatives = F.pad(unnormalized_derivatives, pad=(1, 1))
                  constant = np.log(np.exp(1 - min_derivative) - 1)
                  unnormalized_derivatives[..., 0] = constant
                  unnormalized_derivatives[..., -1] = constant
          
                  outputs[outside_interval_mask] = inputs[outside_interval_mask]
                  logabsdet[outside_interval_mask] = 0
              else:
                  raise RuntimeError('{} tails are not implemented.'.format(tails))
          
              outputs[inside_interval_mask], logabsdet[inside_interval_mask] = rational_quadratic_spline(
              inputs=inputs[inside_interval_mask],
              unnormalized_widths=unnormalized_widths[inside_interval_mask, :],
              unnormalized_heights=unnormalized_heights[inside_interval_mask, :],
              unnormalized_derivatives=unnormalized_derivatives[inside_interval_mask, :],
              inverse=inverse,
              left=-tail_bound, right=tail_bound, bottom=-tail_bound, top=tail_bound,
              min_bin_width=min_bin_width,
              min_bin_height=min_bin_height,
              min_derivative=min_derivative
              )
          
              return outputs, logabsdet
          
        • Modified code

          def unconstrained_rational_quadratic_spline(inputs,
                                      unnormalized_widths,
                                      unnormalized_heights,
                                      unnormalized_derivatives,
                                      inverse=False,
                                      tails='linear',
                                      tail_bound=1.,
                                      min_bin_width=DEFAULT_MIN_BIN_WIDTH,
                                      min_bin_height=DEFAULT_MIN_BIN_HEIGHT,
                                      min_derivative=DEFAULT_MIN_DERIVATIVE):
                  inside_interval_mask = (~(inputs < -tail_bound)) & (~(inputs > tail_bound))
                  outside_interval_mask = ~inside_interval_mask
          
                  outputs = torch.zeros_like(inputs)
                  logabsdet = torch.zeros_like(inputs)
          
                  if tails == 'linear':
                      unnormalized_derivatives = F.pad(unnormalized_derivatives, pad=(1, 1))
                      constant = np.log(np.exp(1 - min_derivative) - 1)
                      unnormalized_derivatives[..., 0] = constant
                      unnormalized_derivatives[..., -1] = constant
          
                      outputs = inputs * outside_interval_mask
                  else:
                      raise RuntimeError('{} tails are not implemented.'.format(tails))
          
                  outputs, logabsdet = rational_quadratic_spline(
                      inputs=inputs,
                      unnormalized_widths=unnormalized_widths,
                      unnormalized_heights=unnormalized_heights,
                      unnormalized_derivatives=unnormalized_derivatives,
                      inverse=inverse,
                      left=-tail_bound, right=tail_bound, bottom=-tail_bound, top=tail_bound,
                      min_bin_width=min_bin_width,
                      min_bin_height=min_bin_height,
                      min_derivative=min_derivative
                  )
                  outputs = outputs * inside_interval_mask
                  logabsdet = logabsdet * inside_interval_mask
          
                  return outputs, logabsdet
          
      • models.py

        • Line 50:
          • Original code def forward(self, x, x_mask, w=None, g=None, reverse=False, noise_scale=1.0):
          • Modified code def forward(self, x, randn, x_mask, w=None, g=None, reverse=False, noise_scale=1.0):
        • Line 90
          • Original code z = torch.randn(x.size(0), 2, x.size(2)).to(device=x.device, dtype=x.dtype) * noise_scale
          • Modified code z = randn.to(device=x.device, dtype=x.dtype) * noise_scale
        • Lines 167 to 176

          • Original code def forward(self, x, x_lengths): x = self.emb(x) * math.sqrt(self.hidden_channels) # [b, t, h] x = torch.transpose(x, 1, -1) # [b, h, t] x_mask = torch.unsqueeze(commons.sequence_mask(x_lengths, x.size(2)), 1).to(x.dtype)

                x = self.encoder(x * x_mask, x_mask)
                stats = self.proj(x) * x_mask
            
                m, logs = torch.split(stats, self.out_channels, dim=1)
                return x, m, logs, x_mask
            
          • Modified code

            def forward(self, x, x_mask):
                x = self.emb(x) * math.sqrt(self.hidden_channels) # [b, t, h]
                x = torch.transpose(x, 1, -1) # [b, h, t]
                # x_mask = torch.unsqueeze(commons.sequence_mask(x_lengths, x.size(2)), 1).to(x.dtype)
            
                x = self.encoder(x * x_mask, x_mask)
                stats = self.proj(x) * x_mask
            
                m, logs = torch.split(stats, self.out_channels, dim=1)
                return x, m, logs, x_mask
            
        • Line 240

          • Original code z = (m + torch.randn_like(m) * torch.exp(logs)) * x_mask
          • Modified code z = (m + torch.exp(logs)) * x_mask
        • Lines 499 to 523

          • Original code def infer(self, x, x_lengths, sid=None, noise_scale=1, length_scale=1, noise_scale_w=1., max_len=None): x, m_p, logs_p, x_mask = self.enc_p(x, x_lengths) if self.n_speakers > 0: g = self.emb_g(sid).unsqueeze(-1) # [b, h, 1] else: g = None

                if self.use_sdp:
                    logw = self.dp(x, x_mask, g=g, reverse=True, noise_scale=noise_scale_w)
                else:
                    logw = self.dp(x, x_mask, g=g)
                w = torch.exp(logw) * x_mask * length_scale
                w_ceil = torch.ceil(w)
                y_lengths = torch.clamp_min(torch.sum(w_ceil, [1, 2]), 1).long()
                y_mask = torch.unsqueeze(commons.sequence_mask(y_lengths, None), 1).to(x_mask.dtype)
                attn_mask = torch.unsqueeze(x_mask, 2) * torch.unsqueeze(y_mask, -1)
                attn = commons.generate_path(w_ceil, attn_mask)
            
                m_p = torch.matmul(attn.squeeze(1), m_p.transpose(1, 2)).transpose(1, 2) # [b, t', t], [b, t, d] -> [b, d, t']
                logs_p = torch.matmul(attn.squeeze(1), logs_p.transpose(1, 2)).transpose(1, 2) # [b, t', t], [b, t, d] -> [b, d, t']
            
                z_p = m_p + torch.randn_like(m_p) * torch.exp(logs_p) * noise_scale
                z = self.flow(z_p, y_mask, g=g, reverse=True)
                o = self.dec((z * y_mask)[:,:,:max_len], g=g)
                return o, attn, y_mask, (z, z_p, m_p, logs_p)
            
          • Modified code

            def infer(self, x, x_mask, sid=None, noise_scale=1, z_fixed=None, max_len=None):
                max_length=1000
                x, m_p, logs_p, x_mask = self.enc_p(x, x_mask)
                if self.n_speakers > 0:
                g = self.emb_g(sid).unsqueeze(-1) # [b, h, 1]
                else:
                g = None
                randn = z_fixed[:,:2,:500]
                if self.use_sdp:
                logw = self.dp(x, randn, x_mask, g=g, reverse=True, noise_scale=self.noise_scale_w)
                else:
                logw = self.dp(x, randn, x_mask, g=g)
                w = torch.exp(logw) * x_mask * self.length_scale
            
                # w_ceil = torch.ceil(w)
                w_ceil = -torch.floor(-w)
            
                y_lengths = torch.clamp_min(torch.sum(w_ceil, [1, 2]), 1).long()
                # y_mask = torch.unsqueeze(commons.sequence_mask(y_lengths, None), 1).to(x_mask.dtype)
                y_mask = torch.unsqueeze(commons.sequence_mask(y_lengths, max_length), 1).to(x_mask.dtype)
                attn_mask = torch.unsqueeze(x_mask, 2) * torch.unsqueeze(y_mask, -1)
                attn = commons.generate_path(w_ceil, attn_mask)
            
                m_p = torch.matmul(attn.squeeze(1), m_p.transpose(1, 2)).transpose(1, 2) # [b, t', t], [b, t, d] -> [b, d, t']
                logs_p = torch.matmul(attn.squeeze(1), logs_p.transpose(1, 2)).transpose(1, 2) # [b, t', t], [b, t, d] -> [b, d, t']
            
                z_p = m_p + z_fixed * torch.exp(logs_p) * noise_scale[0]
                z = self.flow(z_p, y_mask, g=g, reverse=True)
                o = self.dec((z * y_mask)[:,:,:max_len], g=g)
                # return o, attn, y_mask, (z, z_p, m_p, logs_p)
                return o, y_lengths
            
      • attention.py Lines 165 to 170

        • Original code

          if mask is not None:
              scores = scores.masked_fill(mask == 0, -1e4)
              if self.block_length is not None:
                  assert t_s == t_t, "Local attention is only available for self-attention."
                  block_mask = torch.ones_like(scores).triu(-self.block_length).tril(self.block_length)
                  scores = scores.masked_fill(block_mask == 0, -1e4)
          
        • Modified code

          if mask is not None:
              scores = scores.masked_fill(mask == 0, -1e4)
              scores = scores - (1-mask) * 104
              if self.block_length is not None:
                  assert t_s == t_t, "Local attention is only available for self-attention."
                  block_mask = torch.ones_like(scores).triu(-self.block_length).tril(self.block_length)
                  # scores = scores.masked_fill(block_mask == 0, -103)
                  scores = scores - (1-block_mask) * 104
          
  • Model Export

    • Write the model conversion script export_onnx.py:
      import os
      import time
      import random
      import numpy as np
      import torch
      from scipy.io.wavfile import write
      import soundfile as sf
      
      import commons
      import utils
      from models import SynthesizerTrn
      from text import create_symbols_manager, text_to_sequence, cleaned_text_to_sequence, _clean_text
      import argparse
      
      import onnx
      from onnxsim import simplify
      import onnxruntime
      from thop import profile
      from torchsummary import summary
      
      class AudioGenerator():
          def __init__(self, hparams, device):
              self.hparams = hparams
              self._device = device
      
              if 'language' in hparams.data:
                  symbols_manager = create_symbols_manager(hparams.data.language)
              else:
                  symbols_manager = create_symbols_manager('default')
              self.symbol_to_id = symbols_manager._symbol_to_id
      
              self.net_g = create_network(hparams, symbols_manager.symbols, device)
      
          def load(self, path):
              load_checkpoint(self.net_g, path)
      
          def inference(self, text, phoneme_mode=False):
              return do_inference(self.net_g, self.hparams, self.symbol_to_id, text, phoneme_mode, self._device)
      
      def get_text(text, hparams, symbol_to_id, phoneme_mode=False):
          if not phoneme_mode:
              print("1: ", _clean_text(text, hparams.data.text_cleaners))
              text_norm = text_to_sequence(text, hparams.data.text_cleaners, symbol_to_id)
          else:
              print("2: ", text)
              text_norm = cleaned_text_to_sequence(text, symbol_to_id)
      
          if hparams.data.add_blank:
              text_norm = commons.intersperse(text_norm, 0)
          text_norm = torch.LongTensor(text_norm)
          return text_norm
      
      def create_network(hparams, symbols, device):
          net_g = SynthesizerTrn(
              len(symbols),
              hparams.data.filter_length // 2 + 1,
              hparams.train.segment_size // hparams.data.hop_length,
              **hparams.model).to(device)
          _ = net_g.eval()
      
          return net_g
      
      def load_checkpoint(network, path):
          _ = utils.load_checkpoint(path, network, None)
      
      # Assume the network has loaded weights and are ready to do inference
      def do_inference(generator, hparams, symbol_to_id, text, phoneme_mode=False, device=torch.device('cpu')):
          stn_tst = get_text(text, hparams, symbol_to_id, phoneme_mode)
      
          with torch.no_grad():
              x_tst = stn_tst.to(device).unsqueeze(0).int()
              x_tst = torch.cat([x_tst, torch.zeros(1, 500 - x_tst.size(1))],dim=1).int()
              x_tst_lengths = torch.tensor([stn_tst.size(0)])# 1x500
              x_mask = torch.unsqueeze(commons.sequence_mask(x_tst_lengths, x_tst.size(1)), 1).float()
              # noise_scale = 0.667
              # noise_scale_w = 0.8
              noise_scale = torch.tensor([random.uniform(0, 1)])
              noise_scale_w = torch.tensor([random.uniform(0, 1)])
              print(f"The noise ncale is {noise_scale}")
              print(f"The noise scale_w is {noise_scale_w}")
              max_length=1000
              input_tensor = torch.randn(1, hparams.model.hidden_channels, max_length, device="cuda", dtype=torch.float32)
              z_fixed = torch.randn_like(input_tensor).to(x_tst.device)
              #audio_pt = generator.infer(x_tst.int(), x_mask, None, noise_scale_w, z_fixed)
              dummy_input = (x_tst.int(), x_mask, None, noise_scale_w, z_fixed)
      
              generator.forward = generator.infer
              torch.onnx.export(
                      model=generator,
                      args=dummy_input,
                      f='./models/vits.onnx',
                      input_names=["input", "mask", "noise_scale", "z_fixed"],
                      output_names=["z", "y_lengths"],
                      opset_version=13,
                      export_params=True,
                      verbose=False
                  )
      
              model = onnx.load('./models/vits.onnx')
              model_simp, check = simplify(model)
              export_name = './models/vits_sim.onnx'
              onnx.save(model_simp, export_name)
              # exit(1)
              # import pdb
              # pdb.set_trace()
              np.save("./npy_data/x_tst_0.npy", x_tst)
              np.save("./npy_data/x_mask_0.npy", x_mask)
              np.save("./npy_data/noise_scale_w_0.npy", noise_scale_w)
              np.save("./npy_data/z_fixed_0.npy", z_fixed)
      
              onnx_session_static = onnxruntime.InferenceSession('./models/vits_sim.onnx', providers=['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider'])
              input_names_static = [input.name for input in onnx_session_static.get_inputs()]
              output_names_static = [output.name for output in onnx_session_static.get_outputs()]
              data_static = {input_names_static[0]: np.array(x_tst, dtype=np.int32),
                              input_names_static[1]: np.array(x_mask, dtype=np.float32),
                              input_names_static[2]: np.array(noise_scale_w, dtype=np.float32),
                              input_names_static[3]: np.array(z_fixed, dtype=np.float32),}
              data_out_static = onnx_session_static.run(output_names_static, data_static)
              audio=data_out_static[0][0][0]
              y_length = data_out_static[1]
      
          return audio,y_length
      
      def save_to_wav(data, sampling_rate, path):
          sf.write(path, data, 22050, 'PCM_16')
      
      if __name__ == "__main__":
          config_path = "./config/bb_v100.json"
          hps = utils.get_hparams_from_file(config_path)
          audio_generator = AudioGenerator(hps, "cpu")
          checkpoint_path = "./models/G_bb_v100_820000.pth"
          audio_generator.load(checkpoint_path)
          phoneme_mode = False
          do_noise_reduction = True
      
          text = "他的到来是一件好事, 我很欢迎他,  大家好, 我是御坂美琴的儿子。"
          start = time.perf_counter()
          audio,y_length = audio_generator.inference(text, phoneme_mode)
      
          print(f"The inference takes {time.perf_counter() - start} seconds")
      
          print(audio.dtype)
      
          if do_noise_reduction:
              import noisereduce as nr
      
              # perform noise reduction
              audio = nr.reduce_noise(y=audio, sr=hps.data.sampling_rate)
      
          output_dir = './output/'
          # python program to check if a path exists
          # if it doesn’t exist we create one
          if not os.path.exists(output_dir):
              os.makedirs(output_dir)
          filename = 'output.wav'
          file_path = os.path.join(output_dir, filename)
      
          save_to_wav(audio, hps.data.sampling_rate, file_path)
      

2.2 Offline Model Conversion

2.2.1 Preprocessing & Postprocessing Instructions

  • Preprocessing The input information for the VITS model is shown in the figure below, and the model has four inputs. Among them, input is the input token_id, mask is used to record the positions of valid token_ids, and noise_scale and z_fixed are random values used to enrich the model's output diversity. Taking Chinese as an example, converting a segment of input text into the model's input requires the following steps: 1) Use the pypinyin package to convert Chinese to Pinyin. 2) Split the Pinyin, look up the index value corresponding to each phoneme through a dictionary, and convert the phonemes to tokens. 3) Pad each token with 0 before and after to indicate pauses. Finally, convert it to token_id.

  • Postprocessing VITS does not have postprocessing; the output tensor is directly converted to audio using the wav tool. The model output information is as follows:

2.2.2 Offline Model Conversion Process

Note: 1) OpenDLAModel corresponds to the smodel file extracted from the compressed package image-dev_model_convert.tar. 2) The conversion command needs to be run in a Docker environment; please load the SGS Docker environment according to the Docker development environment tutorial.

  • Copy the ONNX model to the conversion code directory $cp models/vits_sim.onnx OpenDLAModel/tts/vits/onnx
  • Conversion command $cd IPU_SDK_Release/docker $bash run_docker.sh # Enter the OpenDLAModel directory in the Docker environment $cd /work/SGS_XXX/OpenDLAModel $bash convert.sh -a tts/vits -c config/tts_vits.cfg -p SGS_IPU_Toolchain (absolute path) -s false
  • Final generated model locations output/{chip}_/vits.img output/{chip}_/vits_fixed.sim output/{chip}_/vits_float.sim

2.2.3 Key Script Parameter Analysis

-   input_config.ini
        [INPUT_CONFIG]
        inputs=speech,mask,noise_scale,z_fixed;                                             # ONNX input node names, separated by commas if there are multiple;
        input_formats=RAWDATA_S16_NHWC,RAWDATA_F32_NHWC,RAWDATA_F32_NHWC,RAWDATA_F32_NHWC;  # Board-side input formats, can be selected according to ONNX input format, e.g., float: RAWDATA_F32_NHWC, int32: RAWDATA_S16_NHWC;
        quantizations=TRUE,TRUE,TRUE,TRUE;                                                  # Enable input quantization, do not modify;
        [OUTPUT_CONFIG]
        outputs=z,y_lengths;                                # ONNX output node names, separated by commas if there are multiple;
        dequantizations=TRUE,TRUE;                          # Whether to enable dequantization, fill according to actual needs, recommended to be TRUE. If set to False, output will be int16; if set to True, output will be float32.

-   tts_vits.cfg
        [VITS]
        CHIP_LIST=pcupid                    # Platform name, must match the board platform, otherwise the model cannot run
        Model_LIST=vits_sim                 # Input ONNX model name
        INPUT_SIZE_LIST=0                   # Model input resolution
        INPUT_INI_LIST=input_config.ini     # Configuration file
        CLASS_NUM_LIST=0                    # Just fill in 0
        SAVE_NAME_LIST=vits.img             # Output model name
        QUANT_DATA_PATH=image_lists.txt     # Quantization data path

2.3 Model Simulation

  • Obtain float/fixed/offline model outputs $bash convert.sh -a tts/vits -c config/tts_vits.cfg -p SGS_IPU_Toolchain (absolute path) -s true After executing the above command, the float model's output tensor will be saved by default in a txt file under the path tts/vits/log/output. In addition, the tts/vits/convert.sh script also provides simulation examples for fixed and offline; users can obtain outputs for the fixed and offline models by uncommenting code blocks during execution.
  • Model Accuracy Comparison With the input being the same as the aforementioned models, enter the environment built in section 2.1, and print the result at line 351 in the vits-mandarin-windows/export_onnx.py file: print(audio) This will obtain the output tensor of the corresponding node in the PyTorch model, allowing for comparison with float, fixed, and offline models. It should also be noted that the original model's output format is NCHW, while the output format of the float/fixed/offline models is NHWC.

3 Board-side Deployment

3.1 Program Compilation

Before compiling the example program, it is necessary to select the appropriate deconfig based on the board (nand/nor/emmc, ddr model, etc.) for the complete SDK compilation. For details, refer to the alkaid SDK sigdoc document "Development Environment Setup." - Compile the board-side VITS example. $cd sdk/verify/opendla make clean && make source/tts/vits -j8 - Final generated executable file location sdk/verify/opendla/out//app/prog_tts_vits

3.2 Running Files

When running the program, you need to copy the following files to the board:

  • prog_tts_vits
  • vits.img
  • zh_tn_tagger.fst
  • zh_tn_verbalizer.fst

3.3 Running Instructions

  • Usage: ./prog_tts_vits txt model zh_tn_tagger.fst zh_tn_verbalizer.fst (command to run the executable)

  • Required Input:

    • txt: path to the input text
    • model: path to the offline model to be tested
    • zh_tn_tagger.fst: third-party resource required for text preprocessing, used for Chinese normalization
    • zh_tn_verbalizer.fst: third-party resource required for text preprocessing, used for Chinese normalization
  • Example input text: input_word.txt The United States is one of China's trade partners, and the current trade pattern between China and the United States is the result of supply and demand matching and market allocation. Art is the horn of the era's advancement, and artists are the engineers of the soul.
  • Typical Output: ./prog_tts_vits resource/input_word.txt models/vits.img resource/wetext/zh_tn_tagger.fst resour ce/wetext/zh_tn_verbalizer.fst input text: The United States is one of China's trade partners, and the current trade pattern between China and the United States is the result of supply and demand matching and market allocation. model invoke time: 2659.814000 ms Generated: 220416 samples of audio 0.000000 -0.000093 0.000521 -0.000707 0.000558 -0.000391 WAV file 'output_0.wav' has been written input text: Art is the horn of the era's advancement, and artists are the engineers of the soul. model invoke time: 2659.969000 ms Generated: 106496 samples of audio 0.000000 -0.000093 0.000521 -0.000707 0.000521 -0.000391 WAV file 'output_1.wav' has been written