Skip to content

Conformer

1 Overview

1.1 Background Introduction

The Conformer model is a hybrid model that combines the self-attention mechanism of Transformer and CNN convolution modules, primarily used in the field of speech recognition. It can convert input audio into the corresponding text sequence, with specific accuracy details as follows:

Since the Conformer official implementation does not provide model weights, the models used in this project come from the Wenet framework. The list of open-source Conformer models provided by Wenet is as follows:

For more details, please refer to the official Wenet documentation:

https://github.com/wenet-e2e/wenet/blob/v3.0.1/docs/pretrained_models.md

The Checkpoint Model we use is based on WenetSpeech training, and can be downloaded at:

https://wenet.org.cn/downloads?models=wenet&version=wenetspeech_u2pp_conformer_exp.tar.gz

1.2 Usage Instructions

The Linux SDK-alkaid includes pre-converted offline models and board examples by default. The relevant file paths are as follows:

  • Board example program path:

    Linux_SDK/sdk/verify/opendla/source/asr/conformer
    
  • Board offline model path:

    Linux_SDK/project/board/${chip}/dla_file/ipu_open_models/asr/conformer_400x80.img
    
  • Board test audio path:

    Linux_SDK/sdk/verify/opendla/source/resource/BAC009S0764W0121.wav
    
  • Board test dictionary path:

    Linux_SDK/sdk/verify/opendla/source/resource/units_asr_punc_lm.txt
    

If users do not need to convert models, they can directly jump to section 3.

2 Model Conversion

2.1 ONNX Model Conversion

  • Python environment setup:

    $conda create -n wenet python==3.9
    $conda activate wenet
    $git clone https://github.com/wenet-e2e/wenet.git
    $cd wenet
    $pip install -e . -i https://pypi.tuna.tsinghua.edu.cn/simple
    

    Note: We developed based on wenet-v3.0.1 version. The provided python environment setup is for reference only; please refer to the official source code running tutorial for the specific setup process:

    https://github.com/wenet-e2e/wenet/tree/v3.0.1
    
  • Model Testing:

    • Write the model testing script predict.py: import wenet model = wenet.load_model('chinese') # or model = wenet.load_model(model_dir='xxx') result = model.transcribe('audio.wav') print(result['text'])

    • Run the model testing script to ensure the Wenet environment is configured correctly:

      $python predict.py
      

    Note: The testing demo example here comes from the official source code running tutorial. audio.wav needs to be prepared by the user; audio files in test/resources can be referenced and placed in the same directory as predict.py.

  • Model Export:

    • Install dependent libraries: $pip install onnx -i https://pypi.tuna.tsinghua.edu.cn/simple $pip install onnx-simplifier -i https://pypi.tuna.tsinghua.edu.cn/simple

    • Run the Wenet provided model conversion script to ensure the Wenet environment is configured correctly:

      $python wenet/bin/export_onnx_gpu.py \\
          --config opendla/train.yaml \\
          --checkpoint opendla/final.pt \\
          --cmvn_file opendla/global_cmvn \\
          --output_onnx_dir opendla/ \\
          --num_decoding_left_chunks -1 \\
          --reverse_weight 0.3
      

      Here, opendla refers to the model folder downloaded from the official website, which can be named freely. After successful model conversion, the output log information will print:

      INFO:wenet/bin/export_onnx_gpu.py:export to onnx encoder succeed! INFO:wenet/bin/export_onnx_gpu.py:export to onnx decoder succeed!

    • Optimize the graph structure:

      $python -m onnxsim opendla/encoder.onnx opendla/conformer_sim.onnx
      

    At this point, the ONNX model has been converted, but it cannot yet be deployed to our platform, as certain operators need to be modified.

  • Model Modification:

    • Modify the original code to change dynamic length input to fixed length input:

      • At line 152 in wenet/transformer/encoder.py, add the following statement:

        xs_lens = torch.tensor([xs[:,:,0].bool().sum()])
            -   Modify lines 69-79 in `wenet/bin/export_onnx_gpu.py` as follows:
        
        • Original Code ctc_log_probs = self.ctc.log_softmax(encoder_out) encoder_out_lens = encoder_out_lens.int() beam_log_probs, beam_log_probs_idx = torch.topk(ctc_log_probs, self.beam_size, dim=2) return ( encoder_out, encoder_out_lens, ctc_log_probs, beam_log_probs, beam_log_probs_idx, )

        • Modified Code ctc_log_probs = self.ctc.ctc_lo(encoder_out) return (ctc_log_probs)

      • Modify lines 746-827 in wenet/bin/export_onnx_gpu.py as follows:

        • Original Code

          def export_offline_encoder(model, configs, args, logger, encoder_onnx_path):
              bz = 32
              seq_len = 100
              beam_size = args.beam_size
              feature_size = configs["input_dim"]
          
              speech = torch.randn(bz, seq_len, feature_size, dtype=torch.float32)
              speech_lens = torch.randint(low=10,
                                          high=seq_len,
                                          size=(bz, ),
                                          dtype=torch.int32)
              encoder = Encoder(model.encoder, model.ctc, beam_size)
              encoder.eval()
          
              torch.onnx.export(
                  encoder,
                  (speech, speech_lens),
                  encoder_onnx_path,
                  export_params=True,
                  opset_version=13,
                  do_constant_folding=True,
                  input_names=["speech", "speech_lengths"],
                  output_names=[
                      "encoder_out",
                      "encoder_out_lens",
                      "probs"
                      "beam_log_probs",
                      "beam_log_probs_idx",
                  ],
                  dynamic_axes={
                      "speech": {
                          0: "B",
                          1: "T"
                      },
                      "speech_lengths": {
                          0: "B"
                      },
                      "encoder_out": {
                          0: "B",
                          1: "T_OUT"
                      },
                      "encoder_out_lens": {
                          0: "B"
                      },
                      "ctc_log_probs": {
                          0: "B",
                          1: "T_OUT"
                      },
                      "beam_log_probs": {
                          0: "B",
                          1: "T_OUT"
                      },
                      "beam_log_probs_idx": {
                          0: "B",
                          1: "T_OUT"
                      },
                  },
                  verbose=False,
              )
          
              with torch.no_grad():
                  o0,o1,o2,o3,o4 = F.log_softmax(encoder(speech, speech_lens), dim=2)
          
              providers = ["CUDAExecutionProvider"]
              ort_session = onnxruntime.InferenceSession(encoder_onnx_path,
                                                      providers=providers)
              ort_inputs = {
                  "speech": to_numpy(speech),
                  "speech_lengths": to_numpy(speech_lens),
              }
          
              ort_outs = ort_session.run(None, ort_inputs)
          
              # check encoder output
              test(to_numpy([o0,o1,o2,o3,o4]), ort_outs)
              logger.info("export offline onnx encoder succeed!")
              onnx_config = {
                  "beam_size": args.beam_size,
                  "reverse_weight": args.reverse_weight,
                  "ctc_weight": args.ctc_weight,
                  "fp16": args.fp16,
              }
              return onnx_config
          
        • Modified Code

          def export_offline_encoder(model, configs, args, logger, encoder_onnx_path):
              bz = 1
              seq_len = 400
              beam_size = args.beam_size
              feature_size = configs["input_dim"]
          
              speech = torch.randn(bz, seq_len, feature_size, dtype=torch.float32)
              speech_lens = torch.randint(low=10,
                                          high=seq_len,
                                          size=(bz, ),
                                          dtype=torch.int32)
              encoder = Encoder(model.encoder, model.ctc, beam_size)
              encoder.eval()
          
              torch.onnx.export(
                  encoder,
                  (speech, speech_lens),
                  encoder_onnx_path,
                  export_params=True,
                  opset_version=13,
                  do_constant_folding=True,
                  input_names=["speech", "speech_lengths"],
                  output_names=["probs"],
                  verbose=False,
              )
          
              with torch.no_grad():
                  o0 = F.log_softmax(encoder(speech, speech_lens), dim=2)
          
              providers = ["CPUExecutionProvider"]
              ort_session = onnxruntime.InferenceSession(encoder_onnx_path,
                                                      providers=providers)
              ort_inputs = {
                  "speech": to_numpy(speech),
                  # "speech_lengths": to_numpy(speech_lens),
              }
          
              ort_outs = F.log_softmax(torch.tensor(ort_session.run(None, ort_inputs)[0]), dim=2)
          
              # check encoder output
              test(o0, ort_outs)
              logger.info("export offline onnx encoder succeed!")
              onnx_config = {
                  "beam_size": args.beam_size,
                  "reverse_weight": args.reverse_weight,
                  "ctc_weight": args.ctc_weight,
                  "fp16": args.fp16,
              }
              return onnx_config
          

    After completing the modifications, execute the model export step again to generate the deployable conformer.onnx model.

2.2 Offline Model Conversion

2.2.1 Pre & Post-Processing Instructions

  • Pre-processing

    Before feeding audio into the model, the speech wav needs to be converted to fbank. The input information for the successfully converted conformer_sim.onnx model is shown in the figure below. The required length of the input fbank is (1, 400, 80), where 400 is the time sequence length and 80 is the number of channels.

  • Post-processing

    The output information of the successfully converted conformer_sim.onnx model is shown in the figure below, with output dimensions of (1, 99, 5538), where 99 is the length of the output text and 5538 is the number of classes. After obtaining the output features, they need to undergo log_softmax processing, followed by decoding the output features using greedy search to convert them into a text sequence.

2.2.2 Offline Model Conversion Process

Note: 1) OpenDLAModel corresponds to the smodel files extracted from the compressed package image-dev_model_convert.tar. 2) The conversion command needs to be run in a Docker environment; please load the SGS Docker environment according to the Docker Development Environment Tutorial.

  • Copy the ONNX model to the conversion code directory:

    $cp opendla/conformer_sim.onnx OpenDLAModel/asr/conformer/onnx
    
  • Conversion command:

    $cd IPU_SDK_Release/docker
    $bash run_docker.sh
    # Enter the OpenDLAModel directory in the Docker environment
    $cd /work/SGS_XXX/OpenDLAModel
    $bash convert.sh -a asr/conformer -c config/asr_conformer.cfg -p SGS_IPU_Toolchain(absolute path) -s false
    
  • Final generated model locations:

    output/${chip}_${time}/conformer_sim.img
    
    output/${chip}_${time}/conformer_sim_fixed.sim
    
    output/${chip}_${time}/conformer_sim_float.img
    

2.2.3 Key Script Parameter Analysis

  • input_config.ini

        [INPUT_CONFIG]
        inputs=speech;                      # ONNX input node name; separate multiple names with commas if necessary;
        input_formats=RAWDATA_F32_NHWC;     # Board input format; can be chosen based on ONNX input format, e.g., float: RAWDATA_F32_NHWC, int32: RAWDATA_S16_NHWC;
        quantizations=TRUE;                 # Enable input quantization; do not modify;
        [OUTPUT_CONFIG]
        outputs=probs;                      # ONNX output node name; separate multiple names with commas if necessary;
        dequantizations=TRUE;               # Whether to enable dequantization; fill in according to actual needs; recommended to be TRUE. Set to False for int16 output; set to True for float32 output.
        [OPTIMIZE_CONFIG]
        optimize_layernorm_precision=TRUE;
    
  • asr_conformer.cfg

        [COMFORMER]
        CHIP_LIST=pcupid                    # Platform name; must match the board platform, otherwise the model cannot run
        Model_LIST=conformer_sim            # Input ONNX model name
        INPUT_SIZE_LIST=0                   # Model input resolution
        INPUT_INI_LIST=input_config.ini     # Configuration file
        CLASS_NUM_LIST=0                    # Just fill in 0
        SAVE_NAME_LIST=conformer_sim.img    # Output model name; can be modified
        QUANT_DATA_PATH=image_list.txt      # Quantization data path
    

2.3 Model Simulation

  • Obtain float/fixed/offline model output:

    $bash convert.sh -a asr/conformer -c configs/asr_conformer.cfg -p SGS_IPU_Toolchain(absolute path) -s true
    

    After executing the above command, it will default to saving the float model's output tensor to the txt file in asr/conformer/log/output. Additionally, the asr/conformer/convert.sh script also provides simulation examples for fixed and offline; users can uncomment the code blocks to obtain the outputs for fixed and offline models respectively.

  • Model Accuracy Comparison

    Under the condition that the input remains the same as above, enter the environment set up in 2.1. In the wenet/wenet/bin/export_onnx_gpu.py script's export_offline_encoder function, add the print statement:

    print(encoder(speech, speech_lens))
    

    This will allow you to obtain the output tensor corresponding to the PyTorch model node, allowing you to compare it with the float, fixed, and offline models. Additionally, it should be noted that the output format of the original model is NCHW, while the formats of float/fixed/offline model outputs are NHWC.

3 Board Deployment

3.1 Program Compilation

Before compiling the example program for the board, you need to select the deconfig according to the board model (nand/nor/emmc, DDR model, etc.) for the SDK full-package compilation. For details, please refer to the Alkaid SDK SIGDOC "Development Environment Setup" document.

  • Compile the board conformer example:

    $cd sdk/verify/opendla
    $make clean && make source/asr/conformer -j8
    
  • Final executable file address:

    sdk/verify/opendla/out/${AARCH}/app/prog_asr_conformer
    

3.2 Running Files

When running the program, the following files need to be copied to the board:

  • prog_asr_conformer

  • BAC009S0764W0121.wav

  • units_asr_punc_lm.txt

  • conformer_400x80.img

3.3 Running Instructions

  • Usage: ./prog_asr_conformer -i wav -m model -d txt (Execution command for the file)

  • Required Input:

    • wav: Audio path

    • model: Path to the offline model to be tested

    • txt: Dictionary

  • Typical Output:

    ./prog_asr_conformer -i resource/BAC009S0764W0121.wav -m models/conformer_400x80.img -d resource/units_asr_punc_lm.txt
        input path: resource/BAC009S0764W0121.wav
        model path: models/conformer_400x80.img
        dict path: resource/units_client [907] connected, module:ipu
        asr_punc_lm.txt
        num_frames: 418, sizeof(input_buf): 128000
        model invoke time: 330.710000 ms
        load dict...
        vocabulary size: 5538
        decode result...
        ... (Further processing log output)
        ------shutdown IPU0------
        client [907] disconnected, module:ipu