Skip to content

Vad

1 Overview

1.1 Background Introduction

VAD (Voice Activity Detection) is an algorithm that can detect human speech segments from audio, serving as a precursor module for speech recognition and speaker identification tasks. We will deploy the NVIDIA open-source NeMo-VAD model. For detailed information about the model, please visit:

https://github.com/NVIDIA/NeMo/tree/v1.20.0

The model download address is:

https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/vad_multilingual_marblenet/files

1.2 Usage Instructions

The Linux SDK-alkaid comes with pre-converted offline models and board-side examples by default. The relevant file paths are as follows:

  • Board-side example program path Linux_SDK/sdk/verify/opendla/source/vad/nemo
  • Board-side offline model path Linux_SDK/project/board/${chip}/dla_file/ipu_open_models/vad/vad_sim.img
  • Board-side test audio path Linux_SDK/sdk/verify/opendla/source/resource/BAC009S0764W0121.wav

If the user does not need to convert the model, they can directly skip to section 3.

2 Model Conversion

2.1 onnx Model Conversion

  • Setting up the Python environment $conda create --name nemo python==3.10.12 $conda activate nemo $conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch $pip install nemo_toolkit['all'] Note: We are developing based on the nemo-v1.20.0 version. The Python environment setup provided here is only for reference; please refer to the official source code running tutorial for specific setup processes:

    https://github.com/NVIDIA/NeMo/tree/v1.20.0
    
  • Model testing

    Run the inference script to ensure the nemo environment is configured correctly. $cd NeMo $python ./examples/asr/speech_classification/frame_vad_infer.py \ --config-path="./examples/asr/conf/vad" \ --config-name="frame_vad_infer_postprocess.yaml" \

  • Model export

    • Write the model conversion script frame_vad_infer.py
      • In line 104 of examples/asr/speech_classification/frame_vad_infer.py, add: vad_model.export( './vad.onnx', dynamic_axes={}, input_example=[torch.rand((1, 400, 80)).cuda(), {"length":400}])
      • In line 189 of collections/asr/models/asr_model.py, add: if instance(input, list): tmp = input input = tmp[0] length = tmp[1] if input.hap[2] == 80 and input.shape[1] == length: input = torch.transpose(input, 1, 2)
      • In line 204 of collections/asr/models/asr_model.py, modify:
        • Original: ret = dec_fun(encoder_states=encoder_output)
        • Modified: ret = dec_fun(hidden_states=encoder_output)
    • Run the model conversion script frame_vad_infer.py. $python ./examples/asr/speech_classification/frame_vad_infer.py \ --config-path="./examples/asr/conf/vad" \ --config-name="frame_vad_infer_postprocess.yaml" \

2.2 Offline Model Conversion

2.2.1 Pre & Post Processing Instructions

  • Preprocessing Before inputting speech into the model, the audio WAV needs to be converted into fbank. The input information for the successfully converted vad_sim.onnx model is shown in the image below, with the required fbank length of (1, 400, 80). Here, 400 is the time series length, and 80 is the number of channels.
  • Postprocessing This model has no postprocessing operations. After acquiring the output features, you can obtain the prediction value of whether the current input token is valid audio by applying softmax. The output information is shown below:

2.2.2 Offline Model Conversion Process

Note: 1) OpenDLAModel corresponds to the smodel files extracted from the compressed package image-dev_model_convert.tar. 2) The conversion command needs to be run in the Docker environment; please load the SGS Docker environment according to the Docker development environment tutorial first.

  • Copy the ONNX model to the conversion code directory $cp vad_sim.onnx OpenDLAModel/vad/nemo/onnx

  • Conversion command $cd IPU_SDK_Release/docker $bash run_docker.sh # Enter the OpenDLAModel directory in the Docker environment $cd /work/SGS_XXX/OpenDLAModel $bash convert.sh -a vad/nemo -c config/vad_nemo.cfg -p SGS_IPU_Toolchain (absolute path) -s false

  • Final generated model addresses output/{chip}_/vad_sim.img output/{chip}_/vad_sim_fixed.sim output/{chip}_/vad_sim_float.sim

2.2.3 Key Script Parameter Analysis

-   input_config.ini
        [INPUT_CONFIG]
        inputs=audio_signal;                # ONNX input node name, separate with commas if there are multiple;
        input_formats=RAWDATA_FP32_NHWC;    # Board input format, can choose based on the ONNX input format, e.g., float: RAWDATA_F32_NHWC, int32: RAWDATA_S16_NHWC;
        quantizations=TRUE;                 # Enable input quantization, no need to change;
        [OUTPUT_CONFIG]
        outputs=outputs;                    # ONNX output node name, separate with commas if there are multiple;
        dequantizations=TRUE;               # Whether to enable dequantization, fill according to actual needs, recommended to be TRUE. Set to False, output will be int16; set to True, output will be float32
-   vad_nemo.cfg
        [COMFORMER]
        CHIP_LIST=pcupid                    # Platform name, must match board platform, otherwise model will not run
        Model_LIST=vad_sim                  # Input ONNX model name
        INPUT_SIZE_LIST=0                   # Model input resolution
        INPUT_INI_LIST=input_config.ini     # Configuration file
        CLASS_NUM_LIST=0                    # Just fill in 0
        SAVE_NAME_LIST=vad_sim.img          # Output model name
        QUANT_DATA_PATH=image_lists.txt     # Quantization data path

2.3 Model Simulation

  • Obtain float/fixed/offline model outputs $bash convert.sh -a vad/nemo -c config/vad_nemo.cfg -p SGS_IPU_Toolchain (absolute path) -s true After executing the above command, the float model's output tensor will be saved by default in a txt file under the path vad/nemo/log/output. In addition, the vad/nemo/convert.sh script also provides simulation examples for fixed and offline, allowing users to obtain outputs of the fixed and offline models by uncommenting code blocks during runtime.
  • Model Accuracy Comparison With the input being the same as the aforementioned models, enter the environment built in 2.1 chapter, and add the following print statement at line 447 in the examples/asr/asr_vad/speech_to_text_with_vad.py file: print(log_probs) This will obtain the output tensor of the corresponding node in the pytorch model, allowing for comparison with the float, fixed, and offline models. It should be particularly noted that the original model's output format is NCHW, while the output format of the float/fixed/offline models is NHWC.

3 Board-side Deployment

3.1 Program Compilation

Before compiling the example program, you need to select the appropriate deconfig based on the board (nand/nor/emmc, ddr model, etc.) for the complete sdk compilation. For details, you can refer to the alkaid sdk sigdoc document "Development Environment Setup."

  • Compile the board-side nemo example. $cd sdk/verify/opendla $make clean && make source/vad/nemo -j8
  • Final generated executable file location sdk/verify/opendla/out/${AARCH}/app/prog_vad_nemo

3.2 Running Files

When running the program, you need to first copy the following files to the board: - prog_vad_nemo - BAC009S0764W0121 - vad_sim.img

3.3 Running Instructions

  • Usage: ./prog_vad_nemo wav model (command to run the executable)

    • wav: audio file
    • model: offline model
  • Typical Output:

    ./prog_vad_nemo resource/BAC009S0764W0121.wav models/vad_sim.img
    
        client [922] connected, module:ipu
        is_speech score: 0.001206
        is_speech score: 0.002193
        is_speech score: 0.001267
        is_speech score: 0.001172
        is_speech score: 0.001315
        is_speech score: 0.001187
        is_speech score: 0.001588
        is_speech score: 0.001541
        is_speech score: 0.002397
        is_speech score: 0.001925
        is_speech score: 0.004228
        is_speech score: 0.003051
        is_speech score: 0.005933
        is_speech score: 0.004191
        is_speech score: 0.005690
        is_speech score: 0.004748
        is_speech score: 0.009524
        is_speech score: 0.005953
        is_speech score: 0.009928
        is_speech score: 0.017336
        is_speech score: 0.207253
        is_speech score: 0.390963
        is_speech score: 0.838701
        is_speech score: 0.880120
        is_speech score: 0.978352
        is_speech score: 0.991667
        is_speech score: 0.997704
        is_speech score: 0.996953
        is_speech score: 0.999195
        is_speech score: 0.998785
        is_speech score: 0.999110
        is_speech score: 0.999063
        is_speech score: 0.999097
        is_speech score: 0.999155
        is_speech score: 0.999170
        is_speech score: 0.999101
        is_speech score: 0.999304
        is_speech score: 0.999292
        is_speech score: 0.999446
        is_speech score: 0.999426
        is_speech score: 0.999559
        is_speech score: 0.999268
        is_speech score: 0.999304
        is_speech score: 0.998892
        is_speech score: 0.999053
        is_speech score: 0.998227
        is_speech score: 0.998460
        is_speech score: 0.998189
        is_speech score: 0.998425
        is_speech score: 0.998025
        is_speech score: 0.998069
        is_speech score: 0.998468
        is_speech score: 0.999088
        is_speech score: 0.999029
        is_speech score: 0.999137
        is_speech score: 0.999095
        is_speech score: 0.999192
        is_speech score: 0.999230
        is_speech score: 0.999287
        is_speech score: 0.999368
        is_speech score: 0.999313
        is_speech score: 0.999182
        is_speech score: 0.999272
        is_speech score: 0.999301
        is_speech score: 0.999371
        is_speech score: 0.999323
        is_speech score: 0.999371
        is_speech score: 0.999400
        is_speech score: 0.999420
        is_speech score: 0.999357
        is_speech score: 0.999335
        is_speech score: 0.999302
        is_speech score: 0.999152
        is_speech score: 0.998994
        is_speech score: 0.999222
        is_speech score: 0.999228
        is_speech score: 0.999421
        is_speech score: 0.999515
        is_speech score: 0.999520
        is_speech score: 0.999449
        is_speech score: 0.999450
        is_speech score: 0.999400
        is_speech score: 0.999372
        is_speech score: 0.999289
        is_speech score: 0.999198
        is_speech score: 0.999097
        is_speech score: 0.999022
        is_speech score: 0.999040
        is_speech score: 0.998885
        is_speech score: 0.998684
        is_speech score: 0.998773
        is_speech score: 0.998566
        is_speech score: 0.998348
        is_speech score: 0.998506
        is_speech score: 0.998509
        is_speech score: 0.998559
        is_speech score: 0.998362
        is_speech score: 0.998425
        is_speech score: 0.998132
        is_speech score: 0.998449
        is_speech score: 0.997921
        is_speech score: 0.998176
        is_speech score: 0.998372
        is_speech score: 0.998765
        is_speech score: 0.998756
        is_speech score: 0.998616
        is_speech score: 0.998536
        is_speech score: 0.998450
        is_speech score: 0.998342
        is_speech score: 0.998394
        is_speech score: 0.998035
        is_speech score: 0.998153
        is_speech score: 0.998049
        is_speech score: 0.997196
        is_speech score: 0.996593
        is_speech score: 0.996181
        is_speech score: 0.996602
        is_speech score: 0.996609
        is_speech score: 0.996776
        is_speech score: 0.996258
        is_speech score: 0.997197
        is_speech score: 0.996720
        is_speech score: 0.997739
        is_speech score: 0.996624
        is_speech score: 0.997697
        is_speech score: 0.997436
        is_speech score: 0.997792
        is_speech score: 0.997465
        is_speech score: 0.997800
        is_speech score: 0.997511
        is_speech score: 0.998078
        is_speech score: 0.997501
        is_speech score: 0.997395
        is_speech score: 0.996318
        is_speech score: 0.997522
        is_speech score: 0.997541
        is_speech score: 0.998257
        is_speech score: 0.998209
        is_speech score: 0.998666
        is_speech score: 0.998414
        is_speech score: 0.998512
        is_speech score: 0.998298
        is_speech score: 0.998064
        is_speech score: 0.998181
        is_speech score: 0.998808
        is_speech score: 0.998935
        is_speech score: 0.998969
        is_speech score: 0.999039
        is_speech score: 0.999035
        is_speech score: 0.999177
        is_speech score: 0.999238
        is_speech score: 0.999272
        is_speech score: 0.999245
        is_speech score: 0.999046
        is_speech score: 0.998965
        is_speech score: 0.998740
        is_speech score: 0.998614
        is_speech score: 0.998825
        is_speech score: 0.998696
        is_speech score: 0.998248
        is_speech score: 0.996915
        is_speech score: 0.995238
        is_speech score: 0.991321
        is_speech score: 0.992952
        is_speech score: 0.989907
        is_speech score: 0.990689
        is_speech score: 0.987685
        is_speech score: 0.988767
        is_speech score: 0.987314
        is_speech score: 0.982662
        is_speech score: 0.969382
        is_speech score: 0.934737
        is_speech score: 0.890915
        is_speech score: 0.838379
        is_speech score: 0.761765
        is_speech score: 0.678670
        is_speech score: 0.611241
        is_speech score: 0.554069
        is_speech score: 0.497151
        is_speech score: 0.406663
        is_speech score: 0.329079
        is_speech score: 0.229154
        is_speech score: 0.143231
        is_speech score: 0.102593
        is_speech score: 0.054421
        is_speech score: 0.035556
        is_speech score: 0.016550
        is_speech score: 0.010145
        is_speech score: 0.006585
        is_speech score: 0.004256
        is_speech score: 0.003941
        is_speech score: 0.002855
        is_speech score: 0.002925
        is_speech score: 0.002234
        is_speech score: 0.002314
        is_speech score: 0.001892
        is_speech score: 0.002127
        is_speech score: 0.001897
        is_speech score: 0.003490
        is_speech score: 0.002991
        output size: 200 2
        ------shutdown IPU1------
        client [922] disconnected, module:ipu