SepFormer

1 Overview¶

1.1 Background Introduction¶

Speech separation refers to extracting individual speech sources from a mixed audio signal. For example, in a multi-person conference recording, it involves separating each speaker's voice. The open-source speech separation algorithm is sourced from the SpeechBrain's PyTorch open-source toolkit, which integrates various advanced audio processing techniques. For more details, please refer to the official SpeechBrain documentation:

https://github.com/speechbrain/speechbrain

The open-source model used in this instance is trained on the Libri3Mix dataset; therefore, it can separate audio with three mixed voices. The address for the open-source model is as follows:

https://huggingface.co/speechbrain/sepformer-libri3mix/tree/main

Test audio download address:

https://huggingface.co/speechbrain/sepformer-wsj03mix/tree/main

1.2 Usage Instructions¶

The Linux SDK-alkaid comes with pre-converted offline models and board-side examples by default. The relevant file paths are as follows:

Board-side example program path Linux_SDK/sdk/verify/opendla/source/separation
Board-side offline model path Linux_SDK/project/board/${chip}/dla_file/ipu_open_models/separation/separation_lib3mix_sim.img
Board-side test audio path Linux_SDK/sdk/verify/opendla/source/resource/item0_mix.wav

If the user does not need to convert the model, they can directly skip to section 3.

2 Model Conversion¶

2.1 onnx Model Conversion¶

Setting up the Python environment

$conda create -n speechbrain python==3.9
$conda activate speechbrain
$git clone https://github.com/speechbrain/speechbrain.git
$cd speechbrain
$pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
$pip install --editable .

Note: The Python environment setup provided here is only for reference; please refer to the official source code running tutorial for specific setup processes:

https://github.com/speechbrain/speechbrain/blob/develop/README.md

Model testing

Write the model testing script inference.py

from speechbrain.pretrained import SepformerSeparation as separator
import torchaudio
model = separator.from_hparams(source="speechbrain/sepformer-libri3mix", savedir='speechbrain/sepformer-libri3mix')
est_sources = model.separate_file(path='speechbrain/sepformer-wsj03mix/test_mixture_3spks.wav')
torchaudio.save("source1hat.wav", est_sources[:, :, 0].detach().cpu(), 8000)
torchaudio.save("source2hat.wav", est_sources[:, :, 1].detach().cpu(), 8000)
torchaudio.save("source3hat.wav", est_sources[:, :, 2].detach().cpu(), 8000)

Run the model testing script to ensure the speechbrain environment is configured correctly. $python inference.py

Model export

Install dependency libraries

$pip install onnx -i https://pypi.tuna.tsinghua.edu.cn/simple
$pip install onnx-simplifier -i https://pypi.tuna.tsinghua.edu.cn/simple

Write the model conversion script export_onnx.py

import os
import sys
sys.path.append(os.getcwd())
import torch
import  numpy as np

from speechbrain.inference.separation import SepformerSeparation as separator
from speechbrain.utils.fetching import fetch
from speechbrain.utils.data_utils import split_path
import torchaudio

import onnx
import onnxsim
import onnxruntime as ort

model = separator.from_hparams(source="speechbrain/sepformer-libri3mix", savedir='speechbrain/sepformer-libri3mix')

class SeparationModel(torch.nn.Module):
    def __init__(self, encoder, masknet, decoder):
    super().__init__()
    self.encoder = encoder
    self.masknet = masknet
    self.decoder = decoder

    def forward(self, wav):
    source = self.encoder(wav)
    print(source.size())
    source_mask = self.masknet(source)
    print(source_mask.size())
    source = torch.stack([source] * 3)
    sep_h = source * source_mask

    # Decoding
    est_source = torch.cat(
            [
                self.decoder(sep_h[i]).unsqueeze(-1)
                for i in range(3)
            ],
            dim=-1,
    )
    print(est_source.size())

    T_origin = wav.size(1)
    T_est = est_source.size(1)
    if T_origin > T_est:
            est_source = F.pad(est_source, (0, 0, 0, T_origin - T_est))
    else:
            est_source = est_source[:, :T_origin, :]

    return est_source

if __name__ == '__main__':
    source, fl = split_path('speechbrain/sepformer-wsj03mix/test_mixture_3spks.wav')
    path = fetch(fl, source="speechbrain/sepformer-libri3mix", savedir='speechbrain/sepformer-libri3mix')

    batch, fs_file = torchaudio.load(path)
    fs_model = 8000

    # resample the data if needed
    if fs_file != fs_model:
    print(
            "Resampling the audio from {} Hz to {} Hz".format(
                fs_file, fs_model
            )
    )
    tf = torchaudio.transforms.Resample(
            orig_freq=fs_file, new_freq=fs_model
    )
    batch = batch.mean(dim=0, keepdim=True)
    batch = tf(batch)

    mod = batch.shape[1] % 8000
    batch = torch.cat([batch, torch.zeros((1, 8000 - mod))],dim=1)
    div = batch.shape[1] // 8000
    sub_model = SeparationModel(model.mods.encoder, model.mods.masknet, model.mods.decoder)

    result_list = []
    onnx_path = './opendla/separation_lib3mix.onnx'
    for i in range(div):

    if i == div-1:
            input_tensor = batch[:, i*8000:]
    else:
            input_tensor = batch[:, i*8000:(i+1)*8000]

    est_sources = sub_model(input_tensor)

    result_list.append(est_sources)

    torch.onnx.export(
            sub_model,
            (input_tensor),
            onnx_path,
            export_params=True,
            opset_version=13,
            do_constant_folding=True,
            input_names=["audio"],
            output_names=["probs"],
            verbose=False,
    )

    model_onnx = onnx.load(onnx_path)  # load onnx model
    onnx.checker.check_model(model_onnx)  # check onnx model
    model_onnx, check = onnxsim.simplify(model_onnx)
    onnx.save(model_onnx, onnx_path.replace('separation','separation_sim'))
    exit(1)

Run the model conversion script export_onnx.py. python export_onnx.py

2.2 Offline Model Conversion¶

2.2.1 Pre & Post Processing Instructions¶

Preprocessing Before inputting audio into the model, it typically needs to be converted into a tensor suitable for model input using the torchaudio.load interface. Then, to fix the input length, the returned tensor data needs to be truncated or zero-padded. Finally, the tensor data should be normalized. The input information for the model is shown below:

Postprocessing

The speech separation model has no postprocessing operations. After obtaining the model output data, different audio tracks for each speaker can be saved according to the number of output channels. The output information for the model is shown below:

2.2.2 Offline Model Conversion Process¶

Note: 1) OpenDLAModel corresponds to the smodel files extracted from the compressed package image-dev_model_convert.tar. 2) The conversion command needs to be run in the Docker environment; please load the SGS Docker environment according to the Docker development environment tutorial first.

Copy the ONNX model to the conversion code directory

$cp speechbrain/sepformer-libri3mix/separation_sim_lib3mix.onnx OpenDLAModel/separation/sepformer/onnx

Conversion command

$cd IPU_SDK_Release/docker
$bash run_docker.sh
# Enter the OpenDLAModel directory in the docker environment
$cd /work/SGS_XXX/OpenDLAModel
$bash convert.sh -a separation/sepformer -c config/separation_sepformer.cfg -p SGS_IPU_Toolchain (absolute path) -s false

Final generated model addresses

output/${chip}_${time}/separation_sim_lib3mix.img
output/${chip}_${time}/separation_sim_lib3mix_fixed.sim
output/${chip}_${time}/separation_sim_lib3mix_float.sim

2.2.3 Key Script Parameter Analysis¶

-   input_config.ini
        [INPUT_CONFIG]
        inputs=audio;                               # ONNX input node name, separate with commas if there are multiple;
        training_input_formats=RAWDATA_F32_NHWC;    # Format of model training data
        input_formats=RAWDATA_F32_NHWC;             # Board input format, can choose based on the ONNX input format, e.g., float: RAWDATA_F32_NHWC, int32: RAWDATA_S16_NHWC;
        quantizations=TRUE;                         # Enable input quantization, no need to change;
        [OUTPUT_CONFIG]
        outputs=probs;                              # ONNX output node name, separate with commas if there are multiple;
        dequantizations=FALSE;                      # Whether to enable dequantization, fill according to actual needs, recommended to be TRUE. Set to False, output will be int16; set to True, output will be float32;
-   separation_sepformer.cfg
        [SEPFORMER]
        CHIP_LIST=pcupid                            # Platform name, must match board platform, otherwise model will not run
        Model_LIST=separation_sim_lib3mix           # Input ONNX model name
        INPUT_SIZE_LIST=0                           # Model input resolution
        INPUT_INI_LIST=input_config.ini             # Configuration file
        CLASS_NUM_LIST=0                            # Just fill in 0
        SAVE_NAME_LIST=separation_sim_lib3mix.img   # Output model name
        QUANT_DATA_PATH=image_lists.txt             # Quantization data path

2.3 Model Simulation¶

Get float/fixed/offline model output $bash convert.sh -a separation/sepformer -c config/separation_sepformer.cfg -p SGS_IPU_Toolchain (absolute path) -s true After executing the above command, the tensor output of the float model will be saved by default to a txt file in the separation/sepformer/log/output path. Additionally, the separation/sepformer/convert.sh script also provides simulation examples for fixed and offline; users can uncomment the code blocks to obtain fixed and offline model outputs during runtime.
Model Accuracy Comparison Under the condition that the input is the same as the above model, enter the environment built in section 2.1. In the inference.py file, add a print statement after line 8: print(est_sources) This will allow you to obtain the output tensor corresponding to the Pytorch model node, and compare it with the float, fixed, and offline models. Additionally, it is important to note that the output format of the original model is NCHW, while the output formats of the float/fixed/offline models are NHWC.

3 Board-Side Deployment¶

3.1 Program Compilation¶

Before compiling the example program, you need to first select the deconfig for SDK full-package compilation based on the board (nand/nor/emmc, DDR model, etc.), which can be referenced in the alkaid SDK sigdoc document titled "Development Environment Setup."

Compile the board-side sepformer example. $cd sdk/verify/opendla $make clean && make source/separation/sepformer -j8
Final generated executable file address sdk/verify/opendla/out/${AARCH}/app/prog_separation_sepformer

3.2 Running Files¶

When running the program, the following files need to be copied to the board: - prog_separation_sepformer - item0_mix.wav - separation_sim_lib3mix.img

3.3 Running Instructions¶

Usage: ./prog_separation_sepformer wav model (execution file usage command)
Required Input:
- wav: audio file
- model: model file

Typical Output:

./prog_separation_sepformer item0_mix.wav models/separation_sim_lib3mix.img

    client [830] connected, module:ipu
    invoke time: 1064.185000 ms
    invoke time: 2128.087000 ms
    invoke time: 3192.150000 ms
    invoke time: 4255.845000 ms
    invoke time: 5320.448000 ms
    all invoke time: 5320.547000 ms
    WAV file 'spk_0.wav' has been written
    WAV file 'spk_1.wav' has been written
    WAV file 'spk_2.wav' has been written
    ------shutdown IPU0------
    client [830] disconnected, module:ipu