SepFormer
1 Overview¶
1.1 Background Introduction¶
Speech separation refers to extracting individual speech sources from a mixed audio signal. For example, in a multi-person conference recording, it involves separating each speaker's voice. The open-source speech separation algorithm is sourced from the SpeechBrain's PyTorch open-source toolkit, which integrates various advanced audio processing techniques. For more details, please refer to the official SpeechBrain documentation:
https://github.com/speechbrain/speechbrain
The open-source model used in this instance is trained on the Libri3Mix dataset; therefore, it can separate audio with three mixed voices. The address for the open-source model is as follows:
https://huggingface.co/speechbrain/sepformer-libri3mix/tree/main
Test audio download address:
https://huggingface.co/speechbrain/sepformer-wsj03mix/tree/main
1.2 Usage Instructions¶
The Linux SDK-alkaid comes with pre-converted offline models and board-side examples by default. The relevant file paths are as follows:
- Board-side example program path Linux_SDK/sdk/verify/opendla/source/separation
- Board-side offline model path Linux_SDK/project/board/${chip}/dla_file/ipu_open_models/separation/separation_lib3mix_sim.img
- Board-side test audio path Linux_SDK/sdk/verify/opendla/source/resource/item0_mix.wav
If the user does not need to convert the model, they can directly skip to section 3.
2 Model Conversion¶
2.1 onnx Model Conversion¶
-
Setting up the Python environment
$conda create -n speechbrain python==3.9 $conda activate speechbrain $git clone https://github.com/speechbrain/speechbrain.git $cd speechbrain $pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple $pip install --editable .Note: The Python environment setup provided here is only for reference; please refer to the official source code running tutorial for specific setup processes:
https://github.com/speechbrain/speechbrain/blob/develop/README.md -
Model testing
-
Write the model testing script
inference.pyfrom speechbrain.pretrained import SepformerSeparation as separator import torchaudio model = separator.from_hparams(source="speechbrain/sepformer-libri3mix", savedir='speechbrain/sepformer-libri3mix') est_sources = model.separate_file(path='speechbrain/sepformer-wsj03mix/test_mixture_3spks.wav') torchaudio.save("source1hat.wav", est_sources[:, :, 0].detach().cpu(), 8000) torchaudio.save("source2hat.wav", est_sources[:, :, 1].detach().cpu(), 8000) torchaudio.save("source3hat.wav", est_sources[:, :, 2].detach().cpu(), 8000) -
Run the model testing script to ensure the speechbrain environment is configured correctly. $python inference.py
-
-
Model export
-
Install dependency libraries
$pip install onnx -i https://pypi.tuna.tsinghua.edu.cn/simple $pip install onnx-simplifier -i https://pypi.tuna.tsinghua.edu.cn/simple -
Write the model conversion script
export_onnx.pyimport os import sys sys.path.append(os.getcwd()) import torch import numpy as np from speechbrain.inference.separation import SepformerSeparation as separator from speechbrain.utils.fetching import fetch from speechbrain.utils.data_utils import split_path import torchaudio import onnx import onnxsim import onnxruntime as ort model = separator.from_hparams(source="speechbrain/sepformer-libri3mix", savedir='speechbrain/sepformer-libri3mix') class SeparationModel(torch.nn.Module): def __init__(self, encoder, masknet, decoder): super().__init__() self.encoder = encoder self.masknet = masknet self.decoder = decoder def forward(self, wav): source = self.encoder(wav) print(source.size()) source_mask = self.masknet(source) print(source_mask.size()) source = torch.stack([source] * 3) sep_h = source * source_mask # Decoding est_source = torch.cat( [ self.decoder(sep_h[i]).unsqueeze(-1) for i in range(3) ], dim=-1, ) print(est_source.size()) T_origin = wav.size(1) T_est = est_source.size(1) if T_origin > T_est: est_source = F.pad(est_source, (0, 0, 0, T_origin - T_est)) else: est_source = est_source[:, :T_origin, :] return est_source if __name__ == '__main__': source, fl = split_path('speechbrain/sepformer-wsj03mix/test_mixture_3spks.wav') path = fetch(fl, source="speechbrain/sepformer-libri3mix", savedir='speechbrain/sepformer-libri3mix') batch, fs_file = torchaudio.load(path) fs_model = 8000 # resample the data if needed if fs_file != fs_model: print( "Resampling the audio from {} Hz to {} Hz".format( fs_file, fs_model ) ) tf = torchaudio.transforms.Resample( orig_freq=fs_file, new_freq=fs_model ) batch = batch.mean(dim=0, keepdim=True) batch = tf(batch) mod = batch.shape[1] % 8000 batch = torch.cat([batch, torch.zeros((1, 8000 - mod))],dim=1) div = batch.shape[1] // 8000 sub_model = SeparationModel(model.mods.encoder, model.mods.masknet, model.mods.decoder) result_list = [] onnx_path = './opendla/separation_lib3mix.onnx' for i in range(div): if i == div-1: input_tensor = batch[:, i*8000:] else: input_tensor = batch[:, i*8000:(i+1)*8000] est_sources = sub_model(input_tensor) result_list.append(est_sources) torch.onnx.export( sub_model, (input_tensor), onnx_path, export_params=True, opset_version=13, do_constant_folding=True, input_names=["audio"], output_names=["probs"], verbose=False, ) model_onnx = onnx.load(onnx_path) # load onnx model onnx.checker.check_model(model_onnx) # check onnx model model_onnx, check = onnxsim.simplify(model_onnx) onnx.save(model_onnx, onnx_path.replace('separation','separation_sim')) exit(1) -
Run the model conversion script
export_onnx.py. python export_onnx.py
-
2.2 Offline Model Conversion¶
2.2.1 Pre & Post Processing Instructions¶
- Preprocessing
Before inputting audio into the model, it typically needs to be converted into a tensor suitable for model input using the
torchaudio.loadinterface. Then, to fix the input length, the returned tensor data needs to be truncated or zero-padded. Finally, the tensor data should be normalized. The input information for the model is shown below:

-
Postprocessing
The speech separation model has no postprocessing operations. After obtaining the model output data, different audio tracks for each speaker can be saved according to the number of output channels. The output information for the model is shown below:

2.2.2 Offline Model Conversion Process¶
Note: 1) OpenDLAModel corresponds to the smodel files extracted from the compressed package image-dev_model_convert.tar. 2) The conversion command needs to be run in the Docker environment; please load the SGS Docker environment according to the Docker development environment tutorial first.
-
Copy the ONNX model to the conversion code directory
$cp speechbrain/sepformer-libri3mix/separation_sim_lib3mix.onnx OpenDLAModel/separation/sepformer/onnx -
Conversion command
$cd IPU_SDK_Release/docker $bash run_docker.sh # Enter the OpenDLAModel directory in the docker environment $cd /work/SGS_XXX/OpenDLAModel $bash convert.sh -a separation/sepformer -c config/separation_sepformer.cfg -p SGS_IPU_Toolchain (absolute path) -s false -
Final generated model addresses
output/${chip}_${time}/separation_sim_lib3mix.img output/${chip}_${time}/separation_sim_lib3mix_fixed.sim output/${chip}_${time}/separation_sim_lib3mix_float.sim
2.2.3 Key Script Parameter Analysis¶
- input_config.ini
[INPUT_CONFIG]
inputs=audio; # ONNX input node name, separate with commas if there are multiple;
training_input_formats=RAWDATA_F32_NHWC; # Format of model training data
input_formats=RAWDATA_F32_NHWC; # Board input format, can choose based on the ONNX input format, e.g., float: RAWDATA_F32_NHWC, int32: RAWDATA_S16_NHWC;
quantizations=TRUE; # Enable input quantization, no need to change;
[OUTPUT_CONFIG]
outputs=probs; # ONNX output node name, separate with commas if there are multiple;
dequantizations=FALSE; # Whether to enable dequantization, fill according to actual needs, recommended to be TRUE. Set to False, output will be int16; set to True, output will be float32;
- separation_sepformer.cfg
[SEPFORMER]
CHIP_LIST=pcupid # Platform name, must match board platform, otherwise model will not run
Model_LIST=separation_sim_lib3mix # Input ONNX model name
INPUT_SIZE_LIST=0 # Model input resolution
INPUT_INI_LIST=input_config.ini # Configuration file
CLASS_NUM_LIST=0 # Just fill in 0
SAVE_NAME_LIST=separation_sim_lib3mix.img # Output model name
QUANT_DATA_PATH=image_lists.txt # Quantization data path
2.3 Model Simulation¶
- Get float/fixed/offline model output
$bash convert.sh -a separation/sepformer -c config/separation_sepformer.cfg -p SGS_IPU_Toolchain (absolute path) -s true
After executing the above command, the tensor output of the
floatmodel will be saved by default to a txt file in theseparation/sepformer/log/outputpath. Additionally, theseparation/sepformer/convert.shscript also provides simulation examples forfixedandoffline; users can uncomment the code blocks to obtainfixedandofflinemodel outputs during runtime. - Model Accuracy Comparison
Under the condition that the input is the same as the above model, enter the environment built in section 2.1. In the
inference.pyfile, add a print statement after line 8: print(est_sources) This will allow you to obtain the output tensor corresponding to the Pytorch model node, and compare it with the float, fixed, and offline models. Additionally, it is important to note that the output format of the original model isNCHW, while the output formats of the float/fixed/offline models areNHWC.
3 Board-Side Deployment¶
3.1 Program Compilation¶
Before compiling the example program, you need to first select the deconfig for SDK full-package compilation based on the board (nand/nor/emmc, DDR model, etc.), which can be referenced in the alkaid SDK sigdoc document titled "Development Environment Setup."
-
Compile the board-side sepformer example. $cd sdk/verify/opendla $make clean && make source/separation/sepformer -j8
-
Final generated executable file address sdk/verify/opendla/out/${AARCH}/app/prog_separation_sepformer
3.2 Running Files¶
When running the program, the following files need to be copied to the board: - prog_separation_sepformer - item0_mix.wav - separation_sim_lib3mix.img
3.3 Running Instructions¶
- Usage:
./prog_separation_sepformer wav model(execution file usage command) -
Required Input:
- wav: audio file
- model: model file
-
Typical Output:
./prog_separation_sepformer item0_mix.wav models/separation_sim_lib3mix.img client [830] connected, module:ipu invoke time: 1064.185000 ms invoke time: 2128.087000 ms invoke time: 3192.150000 ms invoke time: 4255.845000 ms invoke time: 5320.448000 ms all invoke time: 5320.547000 ms WAV file 'spk_0.wav' has been written WAV file 'spk_1.wav' has been written WAV file 'spk_2.wav' has been written ------shutdown IPU0------ client [830] disconnected, module:ipu