Conformer

1 Overview¶

1.1 Background Introduction¶

The Conformer model is a hybrid model that combines the self-attention mechanism of Transformer and CNN convolution modules, primarily used in the field of speech recognition. It can convert input audio into the corresponding text sequence, with specific accuracy details as follows:

Since the Conformer official implementation does not provide model weights, the models used in this project come from the Wenet framework. The list of open-source Conformer models provided by Wenet is as follows:

For more details, please refer to the official Wenet documentation:

https://github.com/wenet-e2e/wenet/blob/v3.0.1/docs/pretrained_models.md

The Checkpoint Model we use is based on WenetSpeech training, and can be downloaded at:

https://wenet.org.cn/downloads?models=wenet&version=wenetspeech_u2pp_conformer_exp.tar.gz

1.2 Usage Instructions¶

The Linux SDK-alkaid includes pre-converted offline models and board examples by default. The relevant file paths are as follows:

Board example program path:

Linux_SDK/sdk/verify/opendla/source/asr/conformer

Board offline model path:

Linux_SDK/project/board/${chip}/dla_file/ipu_open_models/asr/conformer_400x80.img

Board test audio path:

Linux_SDK/sdk/verify/opendla/source/resource/BAC009S0764W0121.wav

Board test dictionary path:

Linux_SDK/sdk/verify/opendla/source/resource/units_asr_punc_lm.txt

If users do not need to convert models, they can directly jump to section 3.

2 Model Conversion¶

2.1 ONNX Model Conversion¶

Python environment setup:

$conda create -n wenet python==3.9
$conda activate wenet
$git clone https://github.com/wenet-e2e/wenet.git
$cd wenet
$pip install -e . -i https://pypi.tuna.tsinghua.edu.cn/simple

Note: We developed based on wenet-v3.0.1 version. The provided python environment setup is for reference only; please refer to the official source code running tutorial for the specific setup process:

https://github.com/wenet-e2e/wenet/tree/v3.0.1

Model Testing:
- Write the model testing script predict.py: import wenet model = wenet.load_model('chinese') # or model = wenet.load_model(model_dir='xxx') result = model.transcribe('audio.wav') print(result['text'])
- Run the model testing script to ensure the Wenet environment is configured correctly:
```
$python predict.py
```
Note: The testing demo example here comes from the official source code running tutorial. audio.wav needs to be prepared by the user; audio files in test/resources can be referenced and placed in the same directory as predict.py.
Model Export:
- Install dependent libraries: $pip install onnx -i https://pypi.tuna.tsinghua.edu.cn/simple $pip install onnx-simplifier -i https://pypi.tuna.tsinghua.edu.cn/simple
- Run the Wenet provided model conversion script to ensure the Wenet environment is configured correctly:
```
$python wenet/bin/export_onnx_gpu.py \\
    --config opendla/train.yaml \\
    --checkpoint opendla/final.pt \\
    --cmvn_file opendla/global_cmvn \\
    --output_onnx_dir opendla/ \\
    --num_decoding_left_chunks -1 \\
    --reverse_weight 0.3
```
  Here, opendla refers to the model folder downloaded from the official website, which can be named freely. After successful model conversion, the output log information will print:
  
  INFO:wenet/bin/export_onnx_gpu.py:export to onnx encoder succeed! INFO:wenet/bin/export_onnx_gpu.py:export to onnx decoder succeed!
- Optimize the graph structure:
```
$python -m onnxsim opendla/encoder.onnx opendla/conformer_sim.onnx
```
At this point, the ONNX model has been converted, but it cannot yet be deployed to our platform, as certain operators need to be modified.

Model Modification:

Modify the original code to change dynamic length input to fixed length input:

At line 152 in wenet/transformer/encoder.py, add the following statement:
```
xs_lens = torch.tensor([xs[:,:,0].bool().sum()])
    -   Modify lines 69-79 in `wenet/bin/export_onnx_gpu.py` as follows:
```
- Original Code ctc_log_probs = self.ctc.log_softmax(encoder_out) encoder_out_lens = encoder_out_lens.int() beam_log_probs, beam_log_probs_idx = torch.topk(ctc_log_probs, self.beam_size, dim=2) return ( encoder_out, encoder_out_lens, ctc_log_probs, beam_log_probs, beam_log_probs_idx, )
- Modified Code ctc_log_probs = self.ctc.ctc_lo(encoder_out) return (ctc_log_probs)

Modify lines 746-827 in wenet/bin/export_onnx_gpu.py as follows:

Original Code

def export_offline_encoder(model, configs, args, logger, encoder_onnx_path):
    bz = 32
    seq_len = 100
    beam_size = args.beam_size
    feature_size = configs["input_dim"]

    speech = torch.randn(bz, seq_len, feature_size, dtype=torch.float32)
    speech_lens = torch.randint(low=10,
                                high=seq_len,
                                size=(bz, ),
                                dtype=torch.int32)
    encoder = Encoder(model.encoder, model.ctc, beam_size)
    encoder.eval()

    torch.onnx.export(
        encoder,
        (speech, speech_lens),
        encoder_onnx_path,
        export_params=True,
        opset_version=13,
        do_constant_folding=True,
        input_names=["speech", "speech_lengths"],
        output_names=[
            "encoder_out",
            "encoder_out_lens",
            "probs"
            "beam_log_probs",
            "beam_log_probs_idx",
        ],
        dynamic_axes={
            "speech": {
                0: "B",
                1: "T"
            },
            "speech_lengths": {
                0: "B"
            },
            "encoder_out": {
                0: "B",
                1: "T_OUT"
            },
            "encoder_out_lens": {
                0: "B"
            },
            "ctc_log_probs": {
                0: "B",
                1: "T_OUT"
            },
            "beam_log_probs": {
                0: "B",
                1: "T_OUT"
            },
            "beam_log_probs_idx": {
                0: "B",
                1: "T_OUT"
            },
        },
        verbose=False,
    )

    with torch.no_grad():
        o0,o1,o2,o3,o4 = F.log_softmax(encoder(speech, speech_lens), dim=2)

    providers = ["CUDAExecutionProvider"]
    ort_session = onnxruntime.InferenceSession(encoder_onnx_path,
                                            providers=providers)
    ort_inputs = {
        "speech": to_numpy(speech),
        "speech_lengths": to_numpy(speech_lens),
    }

    ort_outs = ort_session.run(None, ort_inputs)

    # check encoder output
    test(to_numpy([o0,o1,o2,o3,o4]), ort_outs)
    logger.info("export offline onnx encoder succeed!")
    onnx_config = {
        "beam_size": args.beam_size,
        "reverse_weight": args.reverse_weight,
        "ctc_weight": args.ctc_weight,
        "fp16": args.fp16,
    }
    return onnx_config

Modified Code

def export_offline_encoder(model, configs, args, logger, encoder_onnx_path):
    bz = 1
    seq_len = 400
    beam_size = args.beam_size
    feature_size = configs["input_dim"]

    speech = torch.randn(bz, seq_len, feature_size, dtype=torch.float32)
    speech_lens = torch.randint(low=10,
                                high=seq_len,
                                size=(bz, ),
                                dtype=torch.int32)
    encoder = Encoder(model.encoder, model.ctc, beam_size)
    encoder.eval()

    torch.onnx.export(
        encoder,
        (speech, speech_lens),
        encoder_onnx_path,
        export_params=True,
        opset_version=13,
        do_constant_folding=True,
        input_names=["speech", "speech_lengths"],
        output_names=["probs"],
        verbose=False,
    )

    with torch.no_grad():
        o0 = F.log_softmax(encoder(speech, speech_lens), dim=2)

    providers = ["CPUExecutionProvider"]
    ort_session = onnxruntime.InferenceSession(encoder_onnx_path,
                                            providers=providers)
    ort_inputs = {
        "speech": to_numpy(speech),
        # "speech_lengths": to_numpy(speech_lens),
    }

    ort_outs = F.log_softmax(torch.tensor(ort_session.run(None, ort_inputs)[0]), dim=2)

    # check encoder output
    test(o0, ort_outs)
    logger.info("export offline onnx encoder succeed!")
    onnx_config = {
        "beam_size": args.beam_size,
        "reverse_weight": args.reverse_weight,
        "ctc_weight": args.ctc_weight,
        "fp16": args.fp16,
    }
    return onnx_config

After completing the modifications, execute the model export step again to generate the deployable conformer.onnx model.

2.2 Offline Model Conversion¶

2.2.1 Pre & Post-Processing Instructions¶

Pre-processing

Before feeding audio into the model, the speech wav needs to be converted to fbank. The input information for the successfully converted conformer_sim.onnx model is shown in the figure below. The required length of the input fbank is (1, 400, 80), where 400 is the time sequence length and 80 is the number of channels.

Post-processing

The output information of the successfully converted conformer_sim.onnx model is shown in the figure below, with output dimensions of (1, 99, 5538), where 99 is the length of the output text and 5538 is the number of classes. After obtaining the output features, they need to undergo log_softmax processing, followed by decoding the output features using greedy search to convert them into a text sequence.

2.2.2 Offline Model Conversion Process¶

Note: 1) OpenDLAModel corresponds to the smodel files extracted from the compressed package image-dev_model_convert.tar. 2) The conversion command needs to be run in a Docker environment; please load the SGS Docker environment according to the Docker Development Environment Tutorial.

Copy the ONNX model to the conversion code directory:

$cp opendla/conformer_sim.onnx OpenDLAModel/asr/conformer/onnx

Conversion command:

$cd IPU_SDK_Release/docker
$bash run_docker.sh
# Enter the OpenDLAModel directory in the Docker environment
$cd /work/SGS_XXX/OpenDLAModel
$bash convert.sh -a asr/conformer -c config/asr_conformer.cfg -p SGS_IPU_Toolchain(absolute path) -s false

Final generated model locations:

output/${chip}_${time}/conformer_sim.img

output/${chip}_${time}/conformer_sim_fixed.sim

output/${chip}_${time}/conformer_sim_float.img

2.2.3 Key Script Parameter Analysis¶

input_config.ini

    [INPUT_CONFIG]
    inputs=speech;                      # ONNX input node name; separate multiple names with commas if necessary;
    input_formats=RAWDATA_F32_NHWC;     # Board input format; can be chosen based on ONNX input format, e.g., float: RAWDATA_F32_NHWC, int32: RAWDATA_S16_NHWC;
    quantizations=TRUE;                 # Enable input quantization; do not modify;
    [OUTPUT_CONFIG]
    outputs=probs;                      # ONNX output node name; separate multiple names with commas if necessary;
    dequantizations=TRUE;               # Whether to enable dequantization; fill in according to actual needs; recommended to be TRUE. Set to False for int16 output; set to True for float32 output.
    [OPTIMIZE_CONFIG]
    optimize_layernorm_precision=TRUE;

asr_conformer.cfg

    [COMFORMER]
    CHIP_LIST=pcupid                    # Platform name; must match the board platform, otherwise the model cannot run
    Model_LIST=conformer_sim            # Input ONNX model name
    INPUT_SIZE_LIST=0                   # Model input resolution
    INPUT_INI_LIST=input_config.ini     # Configuration file
    CLASS_NUM_LIST=0                    # Just fill in 0
    SAVE_NAME_LIST=conformer_sim.img    # Output model name; can be modified
    QUANT_DATA_PATH=image_list.txt      # Quantization data path

2.3 Model Simulation¶

Obtain float/fixed/offline model output:
```
$bash convert.sh -a asr/conformer -c configs/asr_conformer.cfg -p SGS_IPU_Toolchain(absolute path) -s true
```
After executing the above command, it will default to saving the float model's output tensor to the txt file in asr/conformer/log/output. Additionally, the asr/conformer/convert.sh script also provides simulation examples for fixed and offline; users can uncomment the code blocks to obtain the outputs for fixed and offline models respectively.
Model Accuracy Comparison

Under the condition that the input remains the same as above, enter the environment set up in 2.1. In the wenet/wenet/bin/export_onnx_gpu.py script's export_offline_encoder function, add the print statement:
```
print(encoder(speech, speech_lens))
```
This will allow you to obtain the output tensor corresponding to the PyTorch model node, allowing you to compare it with the float, fixed, and offline models. Additionally, it should be noted that the output format of the original model is NCHW, while the formats of float/fixed/offline model outputs are NHWC.

3 Board Deployment¶

3.1 Program Compilation¶

Before compiling the example program for the board, you need to select the deconfig according to the board model (nand/nor/emmc, DDR model, etc.) for the SDK full-package compilation. For details, please refer to the Alkaid SDK SIGDOC "Development Environment Setup" document.

Compile the board conformer example:

$cd sdk/verify/opendla
$make clean && make source/asr/conformer -j8

Final executable file address:

sdk/verify/opendla/out/${AARCH}/app/prog_asr_conformer

3.2 Running Files¶

When running the program, the following files need to be copied to the board:

prog_asr_conformer
BAC009S0764W0121.wav
units_asr_punc_lm.txt
conformer_400x80.img

3.3 Running Instructions¶

Usage: ./prog_asr_conformer -i wav -m model -d txt (Execution command for the file)
Required Input:
- wav: Audio path
- model: Path to the offline model to be tested
- txt: Dictionary

Typical Output:

./prog_asr_conformer -i resource/BAC009S0764W0121.wav -m models/conformer_400x80.img -d resource/units_asr_punc_lm.txt
    input path: resource/BAC009S0764W0121.wav
    model path: models/conformer_400x80.img
    dict path: resource/units_client [907] connected, module:ipu
    asr_punc_lm.txt
    num_frames: 418, sizeof(input_buf): 128000
    model invoke time: 330.710000 ms
    load dict...
    vocabulary size: 5538
    decode result...
    ... (Further processing log output)
    ------shutdown IPU0------
    client [907] disconnected, module:ipu