Skip to content

SepFormer

1 Overview

1.1 Background Introduction

Speech separation refers to extracting individual speech sources from a mixed audio signal. For example, in a multi-person conference recording, it involves separating each speaker's voice. The open-source speech separation algorithm is sourced from the SpeechBrain's PyTorch open-source toolkit, which integrates various advanced audio processing techniques. For more details, please refer to the official SpeechBrain documentation:

https://github.com/speechbrain/speechbrain

The open-source model used in this instance is trained on the Libri3Mix dataset; therefore, it can separate audio with three mixed voices. The address for the open-source model is as follows:

https://huggingface.co/speechbrain/sepformer-libri3mix/tree/main

Test audio download address:

https://huggingface.co/speechbrain/sepformer-wsj03mix/tree/main

1.2 Usage Instructions

The Linux SDK-alkaid comes with pre-converted offline models and board-side examples by default. The relevant file paths are as follows:

  • Board-side example program path Linux_SDK/sdk/verify/opendla/source/separation
  • Board-side offline model path Linux_SDK/project/board/${chip}/dla_file/ipu_open_models/separation/separation_lib3mix_sim.img
  • Board-side test audio path Linux_SDK/sdk/verify/opendla/source/resource/item0_mix.wav

If the user does not need to convert the model, they can directly skip to section 3.

2 Model Conversion

2.1 onnx Model Conversion

  • Setting up the Python environment

    $conda create -n speechbrain python==3.9
    $conda activate speechbrain
    $git clone https://github.com/speechbrain/speechbrain.git
    $cd speechbrain
    $pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
    $pip install --editable .
    

    Note: The Python environment setup provided here is only for reference; please refer to the official source code running tutorial for specific setup processes:

    https://github.com/speechbrain/speechbrain/blob/develop/README.md
    
  • Model testing

    • Write the model testing script inference.py

      from speechbrain.pretrained import SepformerSeparation as separator
      import torchaudio
      model = separator.from_hparams(source="speechbrain/sepformer-libri3mix", savedir='speechbrain/sepformer-libri3mix')
      est_sources = model.separate_file(path='speechbrain/sepformer-wsj03mix/test_mixture_3spks.wav')
      torchaudio.save("source1hat.wav", est_sources[:, :, 0].detach().cpu(), 8000)
      torchaudio.save("source2hat.wav", est_sources[:, :, 1].detach().cpu(), 8000)
      torchaudio.save("source3hat.wav", est_sources[:, :, 2].detach().cpu(), 8000)
      
    • Run the model testing script to ensure the speechbrain environment is configured correctly. $python inference.py

  • Model export

    • Install dependency libraries

      $pip install onnx -i https://pypi.tuna.tsinghua.edu.cn/simple
      $pip install onnx-simplifier -i https://pypi.tuna.tsinghua.edu.cn/simple
      
    • Write the model conversion script export_onnx.py

      import os
      import sys
      sys.path.append(os.getcwd())
      import torch
      import  numpy as np
      
      from speechbrain.inference.separation import SepformerSeparation as separator
      from speechbrain.utils.fetching import fetch
      from speechbrain.utils.data_utils import split_path
      import torchaudio
      
      import onnx
      import onnxsim
      import onnxruntime as ort
      
      model = separator.from_hparams(source="speechbrain/sepformer-libri3mix", savedir='speechbrain/sepformer-libri3mix')
      
      class SeparationModel(torch.nn.Module):
          def __init__(self, encoder, masknet, decoder):
          super().__init__()
          self.encoder = encoder
          self.masknet = masknet
          self.decoder = decoder
      
          def forward(self, wav):
          source = self.encoder(wav)
          print(source.size())
          source_mask = self.masknet(source)
          print(source_mask.size())
          source = torch.stack([source] * 3)
          sep_h = source * source_mask
      
          # Decoding
          est_source = torch.cat(
                  [
                      self.decoder(sep_h[i]).unsqueeze(-1)
                      for i in range(3)
                  ],
                  dim=-1,
          )
          print(est_source.size())
      
          T_origin = wav.size(1)
          T_est = est_source.size(1)
          if T_origin > T_est:
                  est_source = F.pad(est_source, (0, 0, 0, T_origin - T_est))
          else:
                  est_source = est_source[:, :T_origin, :]
      
          return est_source
      
      if __name__ == '__main__':
          source, fl = split_path('speechbrain/sepformer-wsj03mix/test_mixture_3spks.wav')
          path = fetch(fl, source="speechbrain/sepformer-libri3mix", savedir='speechbrain/sepformer-libri3mix')
      
          batch, fs_file = torchaudio.load(path)
          fs_model = 8000
      
          # resample the data if needed
          if fs_file != fs_model:
          print(
                  "Resampling the audio from {} Hz to {} Hz".format(
                      fs_file, fs_model
                  )
          )
          tf = torchaudio.transforms.Resample(
                  orig_freq=fs_file, new_freq=fs_model
          )
          batch = batch.mean(dim=0, keepdim=True)
          batch = tf(batch)
      
          mod = batch.shape[1] % 8000
          batch = torch.cat([batch, torch.zeros((1, 8000 - mod))],dim=1)
          div = batch.shape[1] // 8000
          sub_model = SeparationModel(model.mods.encoder, model.mods.masknet, model.mods.decoder)
      
          result_list = []
          onnx_path = './opendla/separation_lib3mix.onnx'
          for i in range(div):
      
          if i == div-1:
                  input_tensor = batch[:, i*8000:]
          else:
                  input_tensor = batch[:, i*8000:(i+1)*8000]
      
          est_sources = sub_model(input_tensor)
      
          result_list.append(est_sources)
      
          torch.onnx.export(
                  sub_model,
                  (input_tensor),
                  onnx_path,
                  export_params=True,
                  opset_version=13,
                  do_constant_folding=True,
                  input_names=["audio"],
                  output_names=["probs"],
                  verbose=False,
          )
      
          model_onnx = onnx.load(onnx_path)  # load onnx model
          onnx.checker.check_model(model_onnx)  # check onnx model
          model_onnx, check = onnxsim.simplify(model_onnx)
          onnx.save(model_onnx, onnx_path.replace('separation','separation_sim'))
          exit(1)
      
    • Run the model conversion script export_onnx.py. python export_onnx.py

2.2 Offline Model Conversion

2.2.1 Pre & Post Processing Instructions

  • Preprocessing Before inputting audio into the model, it typically needs to be converted into a tensor suitable for model input using the torchaudio.load interface. Then, to fix the input length, the returned tensor data needs to be truncated or zero-padded. Finally, the tensor data should be normalized. The input information for the model is shown below:
  • Postprocessing

    The speech separation model has no postprocessing operations. After obtaining the model output data, different audio tracks for each speaker can be saved according to the number of output channels. The output information for the model is shown below:

2.2.2 Offline Model Conversion Process

Note: 1) OpenDLAModel corresponds to the smodel files extracted from the compressed package image-dev_model_convert.tar. 2) The conversion command needs to be run in the Docker environment; please load the SGS Docker environment according to the Docker development environment tutorial first.

  • Copy the ONNX model to the conversion code directory

    $cp speechbrain/sepformer-libri3mix/separation_sim_lib3mix.onnx OpenDLAModel/separation/sepformer/onnx
    
  • Conversion command

    $cd IPU_SDK_Release/docker
    $bash run_docker.sh
    # Enter the OpenDLAModel directory in the docker environment
    $cd /work/SGS_XXX/OpenDLAModel
    $bash convert.sh -a separation/sepformer -c config/separation_sepformer.cfg -p SGS_IPU_Toolchain (absolute path) -s false
    
  • Final generated model addresses

    output/${chip}_${time}/separation_sim_lib3mix.img
    output/${chip}_${time}/separation_sim_lib3mix_fixed.sim
    output/${chip}_${time}/separation_sim_lib3mix_float.sim
    

2.2.3 Key Script Parameter Analysis

-   input_config.ini
        [INPUT_CONFIG]
        inputs=audio;                               # ONNX input node name, separate with commas if there are multiple;
        training_input_formats=RAWDATA_F32_NHWC;    # Format of model training data
        input_formats=RAWDATA_F32_NHWC;             # Board input format, can choose based on the ONNX input format, e.g., float: RAWDATA_F32_NHWC, int32: RAWDATA_S16_NHWC;
        quantizations=TRUE;                         # Enable input quantization, no need to change;
        [OUTPUT_CONFIG]
        outputs=probs;                              # ONNX output node name, separate with commas if there are multiple;
        dequantizations=FALSE;                      # Whether to enable dequantization, fill according to actual needs, recommended to be TRUE. Set to False, output will be int16; set to True, output will be float32;
-   separation_sepformer.cfg
        [SEPFORMER]
        CHIP_LIST=pcupid                            # Platform name, must match board platform, otherwise model will not run
        Model_LIST=separation_sim_lib3mix           # Input ONNX model name
        INPUT_SIZE_LIST=0                           # Model input resolution
        INPUT_INI_LIST=input_config.ini             # Configuration file
        CLASS_NUM_LIST=0                            # Just fill in 0
        SAVE_NAME_LIST=separation_sim_lib3mix.img   # Output model name
        QUANT_DATA_PATH=image_lists.txt             # Quantization data path

2.3 Model Simulation

  • Get float/fixed/offline model output $bash convert.sh -a separation/sepformer -c config/separation_sepformer.cfg -p SGS_IPU_Toolchain (absolute path) -s true After executing the above command, the tensor output of the float model will be saved by default to a txt file in the separation/sepformer/log/output path. Additionally, the separation/sepformer/convert.sh script also provides simulation examples for fixed and offline; users can uncomment the code blocks to obtain fixed and offline model outputs during runtime.
  • Model Accuracy Comparison Under the condition that the input is the same as the above model, enter the environment built in section 2.1. In the inference.py file, add a print statement after line 8: print(est_sources) This will allow you to obtain the output tensor corresponding to the Pytorch model node, and compare it with the float, fixed, and offline models. Additionally, it is important to note that the output format of the original model is NCHW, while the output formats of the float/fixed/offline models are NHWC.

3 Board-Side Deployment

3.1 Program Compilation

Before compiling the example program, you need to first select the deconfig for SDK full-package compilation based on the board (nand/nor/emmc, DDR model, etc.), which can be referenced in the alkaid SDK sigdoc document titled "Development Environment Setup."

  • Compile the board-side sepformer example. $cd sdk/verify/opendla $make clean && make source/separation/sepformer -j8

  • Final generated executable file address sdk/verify/opendla/out/${AARCH}/app/prog_separation_sepformer

3.2 Running Files

When running the program, the following files need to be copied to the board: - prog_separation_sepformer - item0_mix.wav - separation_sim_lib3mix.img

3.3 Running Instructions

  • Usage: ./prog_separation_sepformer wav model (execution file usage command)
  • Required Input:

    • wav: audio file
    • model: model file
  • Typical Output:

    ./prog_separation_sepformer item0_mix.wav models/separation_sim_lib3mix.img
    
        client [830] connected, module:ipu
        invoke time: 1064.185000 ms
        invoke time: 2128.087000 ms
        invoke time: 3192.150000 ms
        invoke time: 4255.845000 ms
        invoke time: 5320.448000 ms
        all invoke time: 5320.547000 ms
        WAV file 'spk_0.wav' has been written
        WAV file 'spk_1.wav' has been written
        WAV file 'spk_2.wav' has been written
        ------shutdown IPU0------
        client [830] disconnected, module:ipu