4. 模型精度问题调试

1. IPU精度问题概述¶

当模型精度不符合预期时，IPU给出了几种常见的精度问题的Debug思路，具体分为以下三种：

原始框架模型与Float.sim模型精度问题
Float.sim模型和Fixed.sim模型精度问题
板端精度问题

DebugFlow

2. 原始框架模型与Float.sim模型精度问题¶

当模型精度不符合预期时，首选要排查原始框架模型推理结果和Float.sim模型推理结果是否一致。确保浮点模型结果一致后，再排查模型量化问题才有意义。

2.1 如何运行原始AI框架模型¶

使用SGS_IPU_Toolchain/DumpDebug/code/run_caffe.py 运行 Caffe模型

使用示例：

python3 run_caffe.py \
-i 000775.jpg \
--model ./caffe_mobilenet_v2.prototxt \
--weight ./caffe_mobilenet_v2.caffemodel \
-n ./caffe_mobilenet_v2.py \
--input_config ./input_config.ini

使用SGS_IPU_Toolchain/DumpDebug/code/run_onnx.py 运行 Onnx模型

使用示例：

python3 run_onnx.py \
-i 000775.jpg \
--model ./mobilenet_v2.onnx \
-n ./mobilenet_v2.py \
--input_config ./input_config.ini

使用SGS_IPU_Toolchain/DumpDebug/code/run_tflite.py 运行 Tflite模型

使用示例：

python3 run_tflite.py \
-i 000775.jpg \
--model ./mobilenet_v2.tflite \
-n ./mobilenet_v2.py

将原始AI框架模型的推理结果与Float.sim的推理结果比较，如果不一致，需要逐层Dump数据比较是从哪层算子开始导致的模型结果不一致。

使用须知

首先要确保原始模型前处理和Float.sim模型前处理结果一致。
原始模型推理结果保存在output目录，Float.sim模型推理结果保存在log/output目录，记得每次推理后，对生成文件重命名，避免覆盖。

2.2 如何dump原始AI框架模型推理数据¶

2.2.1 Dump解析Caffe原模型¶

工具的位置在 SGS_IPU_Toolchain/DumpDebug/code/caffe_dump_data.py。
caffe_dump_data.py脚本需要使用 python3 运行（IPU Toolchain环境已包含caffe的python运行环境），用于以字符串形式或者二进制形式dump出caffe原模型每一层的数据。

使用示例：

python3 caffe_dump_data.py \
--model_file caffe_mobilenet_v2.prototxt \
--weight_file caffe_mobilenet_v2.caffemodel \
--image ./img.bmp \
--dump_bin True \
-n ./caffe_mobilenet_v2.py

使用参数说明：

① 必选参数说明：

--model_file: Caffe原模型prototxt文件路径。
--weight_file: Caffe原模型caffemodel文件路径。
-i,--image : 图片文件或图片文件夹路径或 指定图片路径列表文件 。
--dump_bin : 是否以二进制形式dump每一层结果，可选 True / False。

使用须知

True: Dump结果以二进制形式保存于当前运行目录下,将自动创建文件夹dumpData, 内存有caffe_NHWC_outtensor_dump.bin结果文件 ./dumpData/caffe_NHWC_outtensor_dump.bin（4维tensor排布格式为NHWC，与IPU模型相同）
False: Dump结果以字符串形式保存于当前运行目录下，将自动创建文件夹dumpData，内存有caffe_NHWC_outtensor_dump.txt结果文件及NHWC和NCHW文件夹 ./dumpData/caffe_NHWC_outtensor_dump.txt（4维tensor排布格式为NHWC，与IPU模型相同） ./dumpData/NHWC（每层输出单独生成文件，4维tensor排布格式为NHWC，与IPU模型相同） ./dumpData/NCHW（每层输出单独生成文件，4维tensor排布格式为NCHW，与原模型排布相同）
如果需要使用dump_debug.py脚本分析数据，务必使用 True。

-n, --preprocess : 前处理方法，请直接给定前处理python文件路径。请使用运行sim模型的前处理文件。

② 可选参数说明：

--input_config : input_config.ini的路径。如果模型配置了input_layouts=NCHW，可增加input_config.ini的配置，这样才能使caffe_dump_data.py在内部对齐-n, --preprocess的数据排布。

使用须知

若模型为 多输入 时，-n,--preprocess 参数用法需要多个前处理方法，例如 -n preprocess1.py,preprocess2.py 或者 --preprocess preprocess1.py,preprocess2.py
若模型为 多输入 时，-i/--image 参数传入 指定图片路径列表文件 的形式。

2.2.2 Dump解析Onnx原模型¶

工具的位置在 SGS_IPU_Toolchain/DumpDebug/code/onnx_dump_data.py。
onnx_dump_data.py脚本需要使用 python3 运行（与IPU Toolchain环境相同），用于以字符串形式或者二进制形式dump出Onnx原模型每一层的数据。

使用示例：

python3 onnx_dump_data.py \
--model_file onnx_mobilenet_v2.onnx \
--image ./img.bmp \
--dump_bin True \
-n onnx_mobilenet_v2.py

使用参数说明：

① 必选参数说明：

--model_file: ONNX refine模型文件路径（使用时ConvertTool.py会生成）。或者每层都包含shape信息的ONNX模型路径。

-i,--image : 图片文件或图片文件夹路径或 指定图片路径列表文件 。

--dump_bin : 是否以二进制形式dump每一层结果，可选 True / False。

使用须知

True: Dump结果以二进制形式保存于当前运行目录下,将自动创建文件夹dumpData, 内存有onnx_NHWC_outtensor_dump.bin结果文件 ./dumpData/onnx_NHWC_outtensor_dump.bin（4维tensor排布格式为NHWC，与IPU模型相同）
False: Dump结果以字符串形式保存于当前运行目录下，将自动创建文件夹dumpData，内存有onnx_NHWC_outtensor_dump.txt结果文件及NHWC和NCHW文件夹 ./dumpData/onnx_NHWC_outtensor_dump.txt（4维tensor排布格式为NHWC，与IPU模型相同） ./dumpData/NHWC（每层输出单独生成文件，4维tensor排布格式为NHWC，与IPU模型相同） ./dumpData/NCHW（每层输出单独生成文件，4维tensor排布格式为NCHW，与原模型排布相同）
如果需要使用dump_debug.py脚本分析数据，务必使用 True。

-n, --preprocess : 前处理方法，请直接给定前处理python文件路径。请使用运行sim模型的前处理文件。

② 可选参数说明：

--input_config : input_config.ini的路径。如果模型配置了input_layouts=NCHW，可增加input_config.ini的配置，这样才能使onnx_dump_data.py在内部对齐-n, --preprocess的数据排布。

使用须知

若模型为 多输入 时，-n,--preprocess 参数用法需要多个前处理方法，例如 -n preprocess1.py,preprocess2.py 或者 --preprocess preprocess1.py,preprocess2.py
若模型为 多输入 时，-i/--image 参数传入 指定图片路径列表文件 的形式。

2.3 如何dump Float.sim模型和Fixed.sim模型推理数据¶

逐层dump 浮点模型和定点模型前，需要先将 SGS_IPU_Toolchain/cfg/DebugConfig.txt 文件拷贝到执行目录，同时打开如下开关, 再使用 SGS_IPU_Toolchain/Scripts/calibrator/simulator.py 分别对浮点模型和定点模型仿真，会在path指定目录生成sigma_outtensor_dump.bin 文件。

DebugConfig.txt

dumpTensor            # Dump网络模型各层数据总开关，需打开。（默认关闭）
eliminateGarbage      # Dump网络模型数据时去除无用数据，需打开。
dequantFixed          # 针对定点网络模型，将整形数据转换为浮点数据，需打开。（默认关闭）
#dumpasstring         # Dump网络模型数据为字符串类型，关闭选项为二进制文件类型。（如果需要使用auto_dump_debug.sh脚本分析数据，务必关闭此选项）。
#disableDomainFuseOps # 转换定点网络模型时，取消网络层融合（建议关闭）。
path=                 #  指定生成档案的输出全路径（path= 后面确保填写类似 **/home/user** 的绝对路径。如果path= 后面没有内容或根本没有 path= 的话，则会默认输出到 **$HOME** 位置。路径长度不超过122字节）

使用须知

Dump完成后,结果将保存在sigma_outtensor_dump.bin文件中，若需重命名，请勿修改文件后缀 .bin。
新的Dump会复写sigma_outtensor_dump.bin文件，若重新运行，请注意保存。
DebugConfig.txt文件中disableDomainFuseOps 选项说明：
- 作用于转换定点网络模型阶段，建议关闭。
- 使用时，默认关闭该选项，即calibrator将浮点网络模型转换成定点网络模型时将进行网络融合，如果打开该选项，则将取消网络融合功能。
- 关闭该选项时，定点网络模型和离线网络模型能够在转换时优化网络模型的算子，加速模型的运行，但是也会影响网络的层级结构，使得部分算子的输出不会被Dump到sigma_outtersor_dump.bin文件中。
- 如果需要网络模型的每一层数据，可以打开disableDomainFuseOps选项，重新运行calibrator转换定点网络模型，此时转出的模型没有融合优化，可以输出每层的数据。
离线模型不支持Dump Debug，离线模型网络的层级结构Domain已经融合，不能Dump网络模型中各层的数据。

2.3.1 dump Float.sim模型推理数据¶

使用 SGS_IPU_Toolchain/Scripts/calibrator/simulator.py 工具对浮点模型仿真，会在path指定目录生成sigma_outtensor_dump.bin 文件，仿真推理结果保存在执行目录./log/output 文件夹中。

使用示例：

python3 ~/SGS_IPU_Toolchain/Scripts/calibrator/simulator.py \
-m ./mobilenet_v1_float.sim \
-i ./000775.jpg \
-n ./pre.py \
--soc_version CHIP

2.3.2 dump Fixed.sim模型推理数据¶

使用 SGS_IPU_Toolchain/Scripts/calibrator/simulator.py 工具对定点模型仿真，会在path指定目录生成sigma_outtensor_dump.bin 文件，仿真推理结果保存在执行目录./log/output 文件夹中。

使用示例：

python3 ~/SGS_IPU_Toolchain/Scripts/calibrator/simulator.py \
-m ./mobilenet_v1_fixed.sim \
-i ./000775.jpg \
-n ./pre.py \
--soc_version CHIP

使用须知

浮点模型和定点模型仿真推理生成的sigma_outtensor_dump.bin 文件和log/output 重名，记得每次仿真后，对生成文件重命名，避免覆盖。
非dump数据操作，请及时将DebugConfig.txt文件从执行目录移除，避免影响模型转换操作。

3 使用auto_dump_debug.sh脚本分析数据¶

3.1 原始框架模型和Float.sim模型精度问题¶

如果原始框架模型的推理结果与Float.sim 模型的推理结果不一致，可以使用 SGS_IPU_Toolchain/DumpDebug/auto_dump_debug.sh 工具逐层比较 dump出的bin 文件，能够通过对比样本 bin（sample）和基准 bin（benchmark）的相同output tensor层的COS、MSE和RMSE 来判断是哪层算子的结果不正确，导致模型输出的结果不正确。

使用 SGS_IPU_Toolchain/DumpDebug/auto_dump_debug.sh 脚本运行，查看错误的Layer。

使用示例：

./auto_dump_debug.sh \
/home/user/SGS_IPU_Toolchain \
/path/to/float_sim_sigma_outtensor_dump.bin \
/path/to/caffe_NHWC_outtensor_dump.bin

① 与Caffe原模型对比：

./auto_dump_debug.sh \
/home/user/SGS_IPU_Toolchain \
/home/user/sigma_outtensor_dump.bin \
/home/user/caffe_NHWC_outtensor_dump.bin

② 与Onnx原模型对比：

./auto_dump_debug.sh \
/home/user/SGS_IPU_Toolchain \
/home/user/sigma_outtensor_dump.bin \
/home/user/onnx_NHWC_outtensor_dump.bin

3.2 Float.sim模型和Fixed.sim模型精度问题¶

Float.sim 和Fixed.sim 误差很大,可以通过如下流程Debug： DebugFlow

使用 SGS_IPU_Toolchain/DumpDebug/dauto_dump_debug.sh 工具逐层比较Float.sim模型的推理与Fixed.sim 模型的推理是dump的bin 文件，可以能够通过对比样本 bin（sample）和基准 bin（benchmark）的相同output tensor层的COS、MSE和RMSE 来判断是哪层算子的精度损较大。

使用示例：

./auto_dump_debug.sh \
/home/user/SGS_IPU_Toolchain \
/home/user/sample.bin \
/home/user/benchmark.bin

相关参数说明：

Param1 : SGS_IPU_Toolchain 的路径，如果在当前位置，只需传目录夹名。
Param2 : 需要对比的已经 dump出来的sample bin路径，此处应为定点网络模型Dump出的bin文件路径。
Param3 : 作为参考的已经 dump出来的 benchmark bin路径，此处应为浮点网络模型Dump出的bin文件路径。

使用须知

auto_dump_debug.sh脚本对比结果将受DebugConfig.txt中disableDomainFuseOps影响。

① 未开启disableDomainFus

< 1) conv1/bn_xx_xx 2) conv2_1/expand/bn_xx_xx 3) conv2_1/dwise/bn_xx_xx 4) conv2_1/linear/bn_xx 5) conv2_2/expand/bn_xx_xx 6) conv2_2/dwise/bn_xx_xx 7) conv2_2/linear/bn_xx 8) conv3_1/expand/bn_xx_xx 9) conv3_1/dwise/bn_xx_xx 10) block_3_1 11) conv3_2/expand/bn_xx_xx 12) conv3_2/dwise/bn_xx_xx 13) conv3_2/linear/bn_xx 14) conv4_1/expand/bn_xx_xx 15) conv4_1/dwise/bn_xx_xx 16) block_4_1 17) conv4_2/expand/bn_xx_xx 18) conv4_2/dwise/bn_xx_xx 19) block_4_2 20) conv4_3/expand/bn_xx_xx 21) conv4_3/dwise/bn_xx_xx 22) conv4_3/linear/bn_xx 23) conv4_4/expand/bn_xx_xx 24) conv4_4/dwise/bn_xx_xx 25) block_4_4 26) conv4_5/expand/bn_xx_xx 27) conv4_5/dwise/bn_xx_xx 28) block_4_5 29) conv4_6/expand/bn_xx_xx 30) conv4_6/dwise/bn_xx_xx 31) block_4_6 32) conv4_7/expand/bn_xx_xx 33) conv4_7/dwise/bn_xx_xx 34) conv4_7/linear/bn_xx 35) conv5_1/expand/bn_xx_xx 36) conv5_1/dwise/bn_xx_xx 37) block_5_1 38) conv5_2/expand/bn_xx_xx 39) conv5_2/dwise/bn_xx_xx 40) block_5_2 41) conv5_3/expand/bn_xx_xx 42) conv5_3/dwise/bn_xx_xx 43) conv5_3/linear/bn_xx 44) conv6_1/expand/bn_xx_xx 45) conv6_1/dwise/bn_xx_xx 46) block_6_1 47) conv6_2/expand/bn_xx_xx 48) conv6_2/dwise/bn_xx_xx 49) block_6_2 50) conv6_3/expand/bn_xx_xx 51) conv6_3/dwise/bn_xx_xx 52) conv6_3/linear/bn_xx 53) conv6_4/bn_xx_xx 54) pool6 55) fc7 56) prob

eOps选项时，完成分析后显示如下：（部分） span>

   0) data                       MSE: 0.028521    COS: 0.990174    RMSE: 0.191048 MSE: 0.000060    COS: 0.999969    RMSE: 0.007786 MSE: 0.000133    COS: 0.999922    RMSE: 0.010600 MSE: 0.001055    COS: 0.999604    RMSE: 0.026628 MSE: 0.008872    COS: 0.997960    RMSE: 0.077780 MSE: 0.000849    COS: 0.999145    RMSE: 0.034236 MSE: 0.002526    COS: 0.998491    RMSE: 0.054812 MSE: 0.016800    COS: 0.995695    RMSE: 0.100490 MSE: 0.000814    COS: 0.997401    RMSE: 0.067148 MSE: 0.003532    COS: 0.993593    RMSE: 0.096485 MSE: 0.054808    COS: 0.992504    RMSE: 0.127393 MSE: 0.002223    COS: 0.995493    RMSE: 0.076465 MSE: 0.003252    COS: 0.997917    RMSE: 0.058834 MSE: 0.030526    COS: 0.995008    RMSE: 0.099188 MSE: 0.000484    COS: 0.998078    RMSE: 0.053329 MSE: 0.001138    COS: 0.995249    RMSE: 0.085869 MSE: 0.041154    COS: 0.993904    RMSE: 0.108148 MSE: 0.000628    COS: 0.997298    RMSE: 0.064804 MSE: 0.001379    COS: 0.994016    RMSE: 0.094983 MSE: 0.047413    COS: 0.992955    RMSE: 0.115523 MSE: 0.001276    COS: 0.997816    RMSE: 0.054641 MSE: 0.004304    COS: 0.996268    RMSE: 0.078860 MSE: 0.019364    COS: 0.992479    RMSE: 0.122014 MSE: 0.000678    COS: 0.996980    RMSE: 0.070730 MSE: 0.001787    COS: 0.993290    RMSE: 0.109461 MSE: 0.031072    COS: 0.992266    RMSE: 0.123066 MSE: 0.000657    COS: 0.995815    RMSE: 0.085172 MSE: 0.001807    COS: 0.993433    RMSE: 0.106439 MSE: 0.043631    COS: 0.991963    RMSE: 0.125202 MSE: 0.000639    COS: 0.995696    RMSE: 0.087078 MSE: 0.001854    COS: 0.994223    RMSE: 0.098243 MSE: 0.061734    COS: 0.991830    RMSE: 0.125977 MSE: 0.001029    COS: 0.994616    RMSE: 0.095454 MSE: 0.003071    COS: 0.996565    RMSE: 0.072931 MSE: 0.014118    COS: 0.992815    RMSE: 0.117115 MSE: 0.000649    COS: 0.997619    RMSE: 0.061338 MSE: 0.001037    COS: 0.995496    RMSE: 0.087281 MSE: 0.022231    COS: 0.993252    RMSE: 0.114678 MSE: 0.000522    COS: 0.996605    RMSE: 0.078887 MSE: 0.001216    COS: 0.995946    RMSE: 0.079409 MSE: 0.031804    COS: 0.993276    RMSE: 0.115544 MSE: 0.000592    COS: 0.995792    RMSE: 0.085456 MSE: 0.001892    COS: 0.997871    RMSE: 0.053695 MSE: 0.008352    COS: 0.994394    RMSE: 0.104836 MSE: 0.000359    COS: 0.997728    RMSE: 0.062928 MSE: 0.000522    COS: 0.996774    RMSE: 0.072542 MSE: 0.012969    COS: 0.995122    RMSE: 0.098212 MSE: 0.000311    COS: 0.997588    RMSE: 0.068856 MSE: 0.000694    COS: 0.996954    RMSE: 0.066177 MSE: 0.018713    COS: 0.994592    RMSE: 0.103677 MSE: 0.000360    COS: 0.995374    RMSE: 0.095421 MSE: 0.001144    COS: 0.998021    RMSE: 0.047618 MSE: 0.005441    COS: 0.991614    RMSE: 0.128396 MSE: 0.022556    COS: 0.991828    RMSE: 0.135669 MSE: 0.004184    COS: 0.995849    RMSE: 0.094941 MSE: 0.081746    COS: 0.995249    RMSE: 0.104243 MSE: 0.000000    COS: 1.000000    RMSE: 0.000914
② 开启disableDomainFuseOps选项时，完成分析后显示如下：（部分）
   0) data                       MSE: 0.028521    COS: 0.990174    RMSE: 0.191048
   1) conv1/bn_xx_xx             MSE: 0.000060    COS: 0.999969    RMSE: 0.007786
   2) conv2_1/expand/bn_xx_xx    MSE: 0.000133    COS: 0.999922    RMSE: 0.010600
   3) conv2_1/dwise/bn_xx_xx     MSE: 0.001055    COS: 0.999604    RMSE: 0.026628
   4) conv2_1/linear/bn_xx       MSE: 0.008872    COS: 0.997960    RMSE: 0.077780
   5) conv2_2/expand/bn_xx_xx    MSE: 0.000849    COS: 0.999145    RMSE: 0.034236
   6) conv2_2/dwise/bn_xx_xx     MSE: 0.002526    COS: 0.998491    RMSE: 0.054812
   7) conv2_2/linear/bn_xx       MSE: 0.016800    COS: 0.995695    RMSE: 0.100490
   8) conv3_1/expand/bn_xx_xx    MSE: 0.000814    COS: 0.997401    RMSE: 0.067148
   9) conv3_1/dwise/bn_xx_xx     MSE: 0.003532    COS: 0.993593    RMSE: 0.096485
  10) conv3_1/linear/bn_xx       MSE: 0.026856    COS: 0.987982    RMSE: 0.156977
  11) block_3_1                  MSE: 0.054808    COS: 0.992504    RMSE: 0.127393
  12) conv3_2/expand/bn_xx_xx    MSE: 0.002223    COS: 0.995493    RMSE: 0.076465
  13) conv3_2/dwise/bn_xx_xx     MSE: 0.003252    COS: 0.997917    RMSE: 0.058834
  14) conv3_2/linear/bn_xx       MSE: 0.030526    COS: 0.995008    RMSE: 0.099188
  15) conv4_1/expand/bn_xx_xx    MSE: 0.000484    COS: 0.998078    RMSE: 0.053329
  16) conv4_1/dwise/bn_xx_xx     MSE: 0.001138    COS: 0.995249    RMSE: 0.085869
  17) conv4_1/linear/bn_xx       MSE: 0.008352    COS: 0.991382    RMSE: 0.126459
  18) block_4_1                  MSE: 0.041154    COS: 0.993904    RMSE: 0.108148
  19) conv4_2/expand/bn_xx_xx    MSE: 0.000628    COS: 0.997298    RMSE: 0.064804
  20) conv4_2/dwise/bn_xx_xx     MSE: 0.001379    COS: 0.994016    RMSE: 0.094983
  21) conv4_2/linear/bn_xx       MSE: 0.005679    COS: 0.990344    RMSE: 0.133938
  22) block_4_2                  MSE: 0.047413    COS: 0.992955    RMSE: 0.115523
  23) conv4_3/expand/bn_xx_xx    MSE: 0.001276    COS: 0.997816    RMSE: 0.054641
  24) conv4_3/dwise/bn_xx_xx     MSE: 0.004304    COS: 0.996268    RMSE: 0.078860
  25) conv4_3/linear/bn_xx       MSE: 0.019364    COS: 0.992479    RMSE: 0.122014
  26) conv4_4/expand/bn_xx_xx    MSE: 0.000678    COS: 0.996980    RMSE: 0.070730
  27) conv4_4/dwise/bn_xx_xx     MSE: 0.001787    COS: 0.993290    RMSE: 0.109461
  28) conv4_4/linear/bn_xx       MSE: 0.009286    COS: 0.991433    RMSE: 0.132465
  29) block_4_4                  MSE: 0.031072    COS: 0.992266    RMSE: 0.123066
  30) conv4_5/expand/bn_xx_xx    MSE: 0.000657    COS: 0.995815    RMSE: 0.085172
  31) conv4_5/dwise/bn_xx_xx     MSE: 0.001807    COS: 0.993433    RMSE: 0.106439
  32) conv4_5/linear/bn_xx       MSE: 0.011957    COS: 0.989749    RMSE: 0.144398
  33) block_4_5                  MSE: 0.043631    COS: 0.991963    RMSE: 0.125202
  34) conv4_6/expand/bn_xx_xx    MSE: 0.000639    COS: 0.995696    RMSE: 0.087078
  35) conv4_6/dwise/bn_xx_xx     MSE: 0.001854    COS: 0.994223    RMSE: 0.098243
  36) conv4_6/linear/bn_xx       MSE: 0.018842    COS: 0.989861    RMSE: 0.140551
  37) block_4_6                  MSE: 0.061734    COS: 0.991830    RMSE: 0.125977
  38) conv4_7/expand/bn_xx_xx    MSE: 0.001029    COS: 0.994616    RMSE: 0.095454
  39) conv4_7/dwise/bn_xx_xx     MSE: 0.003071    COS: 0.996565    RMSE: 0.072931
  40) conv4_7/linear/bn_xx       MSE: 0.014118    COS: 0.992815    RMSE: 0.117115
  41) conv5_1/expand/bn_xx_xx    MSE: 0.000649    COS: 0.997619    RMSE: 0.061338
  42) conv5_1/dwise/bn_xx_xx     MSE: 0.001037    COS: 0.995496    RMSE: 0.087281
  43) conv5_1/linear/bn_xx       MSE: 0.008186    COS: 0.992600    RMSE: 0.123085
  44) block_5_1                  MSE: 0.022231    COS: 0.993252    RMSE: 0.114678
  45) conv5_2/expand/bn_xx_xx    MSE: 0.000522    COS: 0.996605    RMSE: 0.078887
  46) conv5_2/dwise/bn_xx_xx     MSE: 0.001216    COS: 0.995946    RMSE: 0.079409
  47) conv5_2/linear/bn_xx       MSE: 0.010952    COS: 0.992991    RMSE: 0.118647
  48) block_5_2                  MSE: 0.031804    COS: 0.993276    RMSE: 0.115544
  49) conv5_3/expand/bn_xx_xx    MSE: 0.000592    COS: 0.995792    RMSE: 0.085456
  50) conv5_3/dwise/bn_xx_xx     MSE: 0.001892    COS: 0.997871    RMSE: 0.053695
  51) conv5_3/linear/bn_xx       MSE: 0.008352    COS: 0.994394    RMSE: 0.104836
  52) conv6_1/expand/bn_xx_xx    MSE: 0.000359    COS: 0.997728    RMSE: 0.062928
  53) conv6_1/dwise/bn_xx_xx     MSE: 0.000522    COS: 0.996774    RMSE: 0.072542
  54) conv6_1/linear/bn_xx       MSE: 0.004180    COS: 0.995212    RMSE: 0.099410
  55) block_6_1                  MSE: 0.012969    COS: 0.995122    RMSE: 0.098212
  56) conv6_2/expand/bn_xx_xx    MSE: 0.000311    COS: 0.997588    RMSE: 0.068856
  57) conv6_2/dwise/bn_xx_xx     MSE: 0.000694    COS: 0.996954    RMSE: 0.066177
  58) conv6_2/linear/bn_xx       MSE: 0.006447    COS: 0.995696    RMSE: 0.095621
  59) block_6_2                  MSE: 0.018713    COS: 0.994592    RMSE: 0.103677
  60) conv6_3/expand/bn_xx_xx    MSE: 0.000360    COS: 0.995374    RMSE: 0.095421
  61) conv6_3/dwise/bn_xx_xx     MSE: 0.001144    COS: 0.998021    RMSE: 0.047618
  62) conv6_3/linear/bn_xx       MSE: 0.005441    COS: 0.991614    RMSE: 0.128396
  63) conv6_4/bn_xx_xx           MSE: 0.022556    COS: 0.991828    RMSE: 0.135669
  64) pool6                      MSE: 0.004184    COS: 0.995849    RMSE: 0.094941
  65) fc7                        MSE: 0.081746    COS: 0.995249    RMSE: 0.104243
  66) prob                       MSE: 0.000000    COS: 1.000000    RMSE: 0.000914


当发现某层指标不佳，想看具体数据时，可在运行路径下找到Undefined_xxx.bin_DumpDebug_out文件夹（xxx.bin为bin的文件名）。

该文件夹内部从0开始编号文件夹，编号与上述指标打印Tensor name前的编号一致。

具体数据以16个为一行排布，如果需要每行仅1个数据排布的方式，运行./auto_dump_debug.sh前在环境变量中多配置：

export SAVE_ONE_ROW=1


3.2.1 检查前处理¶
1）图片输入模型
前处理Python 文件的image_preprocess 函数定义中需要两个参数：


图片路径


归一化标记（norm=True）


其中归一化标记用于区分运行是否需要做图片归一化的动作。在运行Float.sim 模型需要传入归一化后的图片，
image_preprocess调用时会为norm传True，此时需要将归一化动作放在norm为True下。运行Fixed.sim和
Offline模型需要传入与模型输入大小的UINT8格式大小图片，此时不需要做归一化。因此图片输入模型前要检查
是否正确使用了归一化标记，并做了正确的处理。同时，input_config.ini和前处理Python文件中所配置的
mean和std值要保持一致。
2）非图片输入模型
请参考前处理文件编写说明，确保前处理Python文件需要实现和模型输入大小一致的对应类型的numpy.ndarray类型数据。
3.2.2 使用ALL_INT16量化查看结果¶
在input_config.ini中增加如下内容
[CONV_CONFIG]
input_format=ALL_INT16;

需要从原始模型重新转换生成Float.sim才能生效。
3.2.3 使用DumpDebug工具分析¶
对比Float.sim和Fixed.sim每层结果精度，输出每层Tensor的MSE、COS、RMSE信息作为参考。
由于Fixed.sim模型会有算子融合，部分Tensor无法dump出数据，可以通过开启DebugConfig.txt中disableDomainFuseOps选项，
重新生成Fixed.sim模型后dump数据，再用dump_debug.py对比，可以得到更为详细的比对信息。
3.2.4 使用训练量化工具提升精度¶

SGS_IPU_Toolchain/Scripts/calibrator/torch_calibrator.py
该脚本有两大量化等级，推荐使用精度更好的Q2量化等级，但是需要额外配置GPU环境，否则训练时间可能较长。训练结束后会自动选取最优量化参数生成Fixed.sim模型。

3.2.5 能明确找到精度丢失层？¶
通过使用DumpDebug工具得到的MES、COS、RMSE信息中，从过往经验判断，当COS < 0.99或RMSE > 0.1，说明该层精度可能已经无法满足。

如上图所示，在DumpDebug工具得到的信息中，如果这三个指标一直震荡，比如COS总在0.99 ~ 0.98震荡，RMSE也总在 0.01 ~ 0.1震荡，可以通过使用训练量化工具后再次对比。
如果能明确发现从某一层或几层开始，MES、COS、RMSE三个指标都发生了很大改变，说明发生问题的点正是从这层开始的。如果这三个指标没有发生突变，比如COS慢慢从0.99往下掉，RMSE慢慢从0.1往上升，
可以找到第一个上述两个指标超过经验阈值的层作为后续重点查找目标。
3.2.6 通过手动修改参数解决？¶
手动修改参数的主要目的是通过修改指标不佳的Tensor用INT16量化，看能否提升到合适的精度。
当找到怀疑的Tensor后首先需要使用SGS_IPU_Toolchain中提供的Netron工具同时打开Float.sim和Fixed.sim模型，并定位到出问题的Operator位置。

Fixed.sim模型可以点击对应Op，在右边弹窗中点开输入输出Tensor查看到量化信息。如上图所示，当发现Tensor的min / max范围超过20，并且该Tensor仍为UINT8量化，可以通过修改input_config.ini配置卷积升级到INT16量化。
由于input_config.ini中只能配置卷积的输入Tensor为INT16量化，如果该Op不是卷积，可以将该Op前后几个卷积都配置成INT16后观察精度是否改善。
3.2.7 提供相关分析数据¶
如果能提供原始模型或者Float.sim和Fixed.sim模型，能最快复现问题。
如果无法提供模型，需要提供dump出的Float.sim和Fixed.sim的数据，并使用如下脚本dump出Fixed.sim模型的量化信息：
python3 SGS_IPU_Toolchain/Scripst/examples/save_quant_param.py \
    -m mobilenet_v1_fixed.sim

会在mobilenet_v1_fixed.sim所在目录生成mobilenet_v1_fixed.sim.json文件。

使用须知


提供每层精度对比结果


如果能定位出丢失精度的算子，提供丢失精度算子float的输入输出数据和fixed的输入输出数据


如果不能定位丢失精度的算子，把第一步的对比数据给到FAE相关同事帮忙判断具体需要提供的数据



4.  板端精度问题¶
当板端结果和PC端offline仿真结果不一致时，先确定PC端fixed模型仿真结果与offline仿真结果是否一致，如果一致，可通过如下方法定位板端模型结果与PC仿真Offline模型结果不一致的原因。

4.1 使用simulator方法验证Fixed模型与Offline模型结果是否一致¶
PC上使用SGS_IPU_Toolchain/Scripts/calibrator/simulator.py 分别运行Fixed模型和Offline模型，比较二者的输出结果是否一致。
运行Fixed模型示例：
python3 SGS_IPU_Toolchain/Scripts/calibrator/simulator.py \
-i 000775.jpg \
-m mobilenet_v1_fixed.sim \
-n mobilenet_v1_preprocess.py \
--soc_version CHIP


使用须知

运行结果报错在./log/output/目录，需重命名log目录(如 log_fixed)，避免覆盖。


运行Offline模型示例：
python3 SGS_IPU_Toolchain/Scripts/calibrator/simulator.py \
-i 000775.jpg \
-m mobilenet_v1_fixed.sim_sgsimg.img \
-n mobilenet_v1_preprocess.py \
--soc_version CHIP


使用须知

运行结果报错在./log/output/目录，需重命名log目录(如 log_offline)，避免覆盖。


如果Fixed模型结果与Offline模型结果一致，再定位板端模型结果与PC仿真Offline模型结果不一致的原因。
4.2 使用simulator方法验证Offline模型板端结果与PC端结果是否一致¶
PC上使用SGS_IPU_Toolchain/Scripts/calibrator/simulator.py 工具通过rpc协议调用，在板端运行offline模型，同时将板端推理结果 写回到PC端output 目录，方便客户比较板端推理结果与PC端图例结果是否一致，排除由于前处理等因素导致的输入数据不一致的干扰，便于客户快速定位问题。
Linux SDK-alkaid已提供sdk/verify/release_feature/source/dla/ipu_server的app。
在板端运行prog_dla_ipu_server应用程序，开启RPC服务，SGS_IPU_Toolchain/Scripts/calibrator/simulator.py在PC端通过IP地址和端口号访问板端应用程序，将PC 端模型下发到板端运行推理后，再将推理结果返回到PC端。
板端开启RPC应用服务示例：
./prog_dla_ipu_server -p PORT

PC端通过RPC服务在板端运行推理实例：
python3 SGS_IPU_Toolchain/Scripts/calibrator/simulator.py \
-i 000775.jpg \
-m mobilenet_v1_fixed.sim_sgsimg.img \
-n mobilenet_v1_preprocess.py \
--host 板端ip地址 \
--port PORT \
--soc_version CHIP


使用须知
相关参数解释


--host HOST           IPU Server host.


--port PORT           IPU Server port.


--timeout TIMEOUT     Set timeout seconds, default is 60s.


--model_onboard_path  MODEL_ONBOARD_PATH Model on board path.



运行完成后，比较PC端仿真推理结果和通过RPC服务的板端推理结果是否一致。结果如果不一致，可以联系对应的FAE，请求支援。结果如果一致，需要排查板端demo，定位一下是哪里的bug导致的板端推理结果不正确。
5. 模型转换常见错误汇总¶
本小节系统梳理模型转换全流程中可能出现的典型报错场景，涵盖从原始模型格式解析、中间表示转换到目标平台适配等关键环节。通过结构化分类与根因溯源，帮助开发者快速定位问题并制定解决方案。
5.1 转换浮点网络模型阶段根因溯源¶



浮点模型转换
报错现象
原因分析





“`input_formats` is BGR only support `training_input_formats` is BGR!”
运行ConvertTool指令时，ini配置文件中存在training_input_formats与input_formats配置不匹配错误



“KeyError: 'output1'”
运行ConvertTool指令时，ini配置文件中存在inputs/outputs与待转换模型不相同错误



“configparser.MissingSectionHeaderError: File contains no section headers.”
运行ConvertTool指令时，ini配置文件中存在语法错误



“Loopback in graph is not support!”
运行ConvertTool指令时，模型文件中存在回环图的错误



“Nodes in a graph must be topologically sorted, however input 'Transpose_7_o0_rewrite' of node:name: Conv13 OpType: Conv is not output of any previous nodes.”
运行ConvertTool指令时，原始模型文件中存在算子执行顺序不对的错误



“Please export the ONNX with opset_version <= 20”
运行ConvertTool指令时，onnx模型opset版本过高，超出现阶段可支持版本限制



“ONNX PAD axis should not be given as input”
onnx的PAD算子的axis参数不支持tensor格式



“ONNX MVN input is expected to have four dimensions”
onnx的MVN算子仅支持4维输入



“ONNX EXPAND shape input only support const tensor”
onnx的EXPAND算子仅支持const类型的shape信息



“RNN only support sequence_lens empty tensor”
onnx的RNN算子参数支持限制



“ONNX RNN not support clip”
onnx的RNN算子参数支持限制



“Do not support activation_beta”
onnx的RNN算子参数支持限制



“activation_alpha only support 0.01”
onnx的RNN算子参数支持限制



“ONNX RNN  not  support direction reverse”
onnx的RNN算子参数支持限制



“ONNX RNN activation only support [Sigmoid tanh tanh]”
onnx的RNN算子参数支持限制



“ONNX Pad only support constant and reflect mode”
onnx的PAD算子参数支持限制



“ONNX MOD only support fmod 0”
onnx的MOD算子参数支持限制



“LSTM unsupport sequence_lens variable tensor”
onnx的LSTM算子参数支持限制



“ONNX LSTM not support clip”
onnx的LSTM算子参数支持限制



“Do not support activation_beta”
onnx的LSTM算子参数支持限制



“activation_alpha only support 0.01”
onnx的LSTM算子参数支持限制



“ONNX LSTM layout only support 0”
onnx的LSTM算子参数支持限制



“ONNX LSTM input_forget only support 0”
onnx的LSTM算子参数支持限制



“ONNX LSTM  not  support direction reverse”
onnx的LSTM算子参数支持限制



“ONNX LSTM activation only support [Sigmoid tanh tanh]”
onnx的LSTM算子参数支持限制



“ONNX LayerNormalization not support stash_type is 0”
onnx的LayerNormalization算子参数支持限制



“ONNX LayerNormalization not support axis is 0”
onnx的LayerNormalization算子参数支持限制



“GRU unsupport sequence_lens variable tensor”
onnx的GRU算子参数支持限制



“ONNX GRU layout only support 0”
onnx的GRU算子参数支持限制



“ONNX GRU only support forward and bidirectional”
onnx的GRU算子参数支持限制



“ONNX GRU not support clip”
onnx的GRU算子参数支持限制



“gru activation only support Sigmoid and tanh”
onnx的GRU算子参数支持限制



“ONNX GRU not support activation_alpha”
onnx的GRU算子参数支持限制



“Not support yet!”
算子未支持



“Exception: ONNX Upsample only support height and width resize:”
检查到模型中有算子规格超过目标芯片的能力范围，具体参考报错语句



“Exception:  ONNX TOPK SGS only support the top-K sorted elements along a specified axis:/TopK”
检查到模型中有算子规格超过目标芯片的能力范围，具体参考报错语句



“Exception:  ONNX TOPK SGS only support the top-K largest elements along a specified axis:/TopK”
检查到模型中有算子规格超过目标芯片的能力范围，具体参考报错语句



“Exception: ONNX Sum only support 2 inputs:”
检查到模型中有算子规格超过目标芯片的能力范围，具体参考报错语句



“ValueError: Split_V outputs number exceeds the limit.”
检查到模型中有算子规格超过目标芯片的能力范围，具体参考报错语句



“sub0 has wrong dynamic shape that cannot be figured out! please check if source shape is graph input”
检查到模型文件中存在算子属性错误导致无法计算出output tensor的错误



“Exception: ONNX ScatterND reduction attrs didn't support 1 attribute yet!”
onnx的ScatterND算子参数支持限制



“Exception: ONNX ScatterElements include unsopport reduction type:/ScatterElements”
onnx的ScatterElements算子参数支持限制



“Exception: ONNX RESIZE only support 'nearest_mode' : round_prefer_floor/floor/round_prefer_ceil:”
onnx的RESIZE算子参数支持限制



“Exception: ONNX RESIZE only support nearest or linear:”
onnx的RESIZE算子参数支持限制



“Exception: resize do not support keep_aspect_ratio_policy”
onnx的RESIZE算子参数支持限制



“Exception: ONNX RESIZE unsupport 'exclude_outside' non-zero value:”
onnx的RESIZE算子参数支持限制



“Exception: ONNX RESIZE only support 'coordinate_transformation_mode' : align_corners/asymmetric/half_pixel/pytorch_half_pixel:”
onnx的RESIZE算子参数支持限制



“Exception: resize antialias only support 0”
onnx的RESIZE算子参数支持限制



“Exception: ONNX RESIZE scale tensor only support height and width resize:”
onnx的RESIZE算子参数支持限制



“Exception: ONNX RESHAPE unsupport allowedzero set 1:”
onnx的RESHAPE算子参数支持限制



“DYNSHAPESYM sub lack shape info, please check DYNAMIC_CONFIG in ini!”
检查到模型文件中存在算子属性错误导致无法计算出output tensor的错误



“Exception: do not support noop_with_empty_axes not 0:”
onnx的ReduceSum/ReduceMin/ReduceMax/ReduceMean/ReduceL2算子参数支持限制



“Exception: not support onnx op type OneHot yet”
检查到模型文件中存在不支持算子



“Exception: ONNX DepthToSpace c must be divisible by blocksize^2”
onnx的DepthToSpace算子参数支持限制



"Exception: ONNX CUMSUM reverse only support is 0:"
onnx的CUMSUM算子参数支持限制



"Exception: ONNX CUMSUM exclusive only support is 0:"
onnx的CUMSUM算子参数支持限制



"Exception: ONNX ConvTranspose3d filter node /convtranspose3d/ConvTranspose not support kernel size 3  dilation 1 padding 4 4"
onnx的ConvTranspose算子参数支持限制



"Exception: ONNX ConvTranspose not support attr output_shape:"
onnx的ConvTranspose算子参数支持限制



"ValueError: Conv3d strideW strideH can not be larger than 31!!!"
onnx的ConvTranspose算子参数超过目标芯片的能力范围



"Exception: Not support group conv3d yet:/conv3d/Conv"
onnx的Conv算子参数支持限制



"Assertion `s32InputCount <= ((10000))' failed."
onnx的CONCAT算子参数超过目标芯片的能力范围



"Exception: ONNX BatchNormalization not support training_mode not 0:"
onnx的BatchNormalization算子参数支持限制



"BatchNormalization not support spatial not 1:"
onnx的BatchNormalization算子参数支持限制



"BatchNormalization not support is_test not 0:"
onnx的BatchNormalization算子参数支持限制



"Exception: ONNX AveragePool  node  not support kernel size 180 180 stride 300 300"
onnx的AveragePool算子参数超过目标芯片的能力范围



"Exception: ONNX AveragePool no support padding over 255"
onnx的AveragePool算子参数超过目标芯片的能力范围



"Exception: ONNX AveragePool  node  not support kernel size 300 300 stride 1 1"
onnx的AveragePool算子参数超过目标芯片的能力范围



"Exception: ONNX AVGPOOL only support dilation 1:"
onnx的AveragePool算子参数超过目标芯片的能力范围



"Exception: ONNX ARGMAX select_last_index only support 0:"
onnx的ArgMax算子参数支持限制



"Exception: ONNX ARGMIN select_last_index only support 0:"
onnx的ArgMin算子参数支持限制



“ValueError: Not support soc_version: xxx”
xxx不在支持的芯片列表中



“Input graph file {} does not exist!”
模型文件不存在



“google.protobuf.message.DecodeError: Protobuf decoding consumed too few bytes: 1 out of 736”
运行ConvertTool指令，原始模型文件与配置的模型类型不符



“`input_formats` is RGB only support `training_input_formats` is RGB!”
ini配置文件中存在training_input_formats与input_formats配置不匹配错误



“RuntimeError: The model doesn't have input named \"images1\"”
ini配置文件中存在inputs/outputs与待转换模型不相同错误



“configparser.MissingSectionHeaderError: File contains no section headers.”
ini配置文件中存在语法错误



“sub 0, tensor [output] has dynamic shape that cannot be figured out!”
存在未知输入导致shape推导无法完成



“Assert!!! Errors happened in At most one dimension of the new shape can be -1”
reshape算子存在多个-1维度标识



5.2 转换定点网络模型阶段根因溯源¶



定点模型转换
报错现象
原因分析





“FileNotFoundError: No images found in”
运行calibrator指令时，txt内文件格式与训练集不匹配或txt内文件不存在或txt文件格式错误



“void SGS_CheckPoolFilterSize(OperationType, SGS_S32, SGS_S32, SGS_S32, SGS_S32): Assertion `0' failed.”
MXAPOOL的kernel参数超过目标芯片的能力范围



“ERROR: no schedule info. weight name:/Constant_output_0”
检查到Conv parameters超过硬件spec



“ValueError: Not support soc_version: xxx”
soc_version xxx芯片型号不在支持的芯片列表中



“ValueError: Not recognized model.”
运行calibrator指令时模型文件格式错误



5.3 转换离线网络模型阶段根因溯源¶



离线模型转换
报错现象
原因分析





“FileNotFoundError: No such model:”
运行Compiler指令时,模型文件不存在



“ValueError: Not support soc_version: xxx”
运行Compiler指令时soc_version xxx不在支持的芯片列表中



“ValueError: Not recognized model”
运行compiler指令时模型格式错误



5.4 模型精度调试阶段根因溯源¶



精度调试阶段
报错现象
原因分析





“Assert!!! Errors happened in Image Header output 0 ExtFlag check failed”
Offline 文件中存在output ExtFlag错误，请重新转换模型



“Assert!!! Errors happened in Image Header Input 0 ExtFlag check failed”
Offline 文件中存在input ExtFlag错误，请重新转换模型



“Assert!!! Errors happened in Image Header Input 0 Height alignment check failed”
Offline 文件中存在input Height alignment错误，请重新转换模型



“Assert!!! Errors happened in Image Header Input 0 Width alignment check failed”
Offline 文件中存在input Width alignment错误，请重新转换模型



“Assert!!! Errors happened in Image Header output 0 Aligned Buffer Size check failed”
Offline 文件中存在output Aligned Buffer Size错误，请重新转换模型



“Assert!!! Errors happened in Image Header Input 0 Aligned Buffer Size check failed”
Offline 文件中存在input Aligned Buffer Size错误，请重新转换模型



“Assert!!! Errors happened in Image Header output 0 Scalar check failed”
Offline 文件中存在output Scalar错误，请重新转换模型



“Assert!!! Errors happened in Image Header Input 0 Scalar check failed”
Offline 文件中存在input Scalar错误，请重新转换模型



“Assert!!! Errors happened in Image Header output 0 ZPoint check failed”
Offline 文件中存在output ZPoint错误，请重新转换模型



“Assert!!! Errors happened in Image Header Input 0 ZPoint check failed”
Offline 文件中存在input ZPoint错误，请重新转换模型



“Assert!!! Errors happened in Image Header output 0 Shape check failed”
Offline 文件中存在output Shape错误，请重新转换模型



“Assert!!! Errors happened in Image Header Input 0 Shape check failed”
Offline 文件中存在input Shape错误，请重新转换模型



“Assert!!! Errors happened in Image Header output 0 Element format check failed”
Offline 文件中存在output Element format 错误，请重新转换模型



“Assert!!! Errors happened in Image Header Input 0 Element format check failed”
Offline 文件中存在input Element format错误，请重新转换模型



“Assert!!! Errors happened in Image Header output 0 dim check failed”
Offline 文件中存在output tensor dimension错误，请重新转换模型



“Assert!!! Errors happened in Image Header Input 0 dim check failed”
Offline 文件中存在input dimension错误，请重新转换模型



“Assert!!! Errors happened in Image Header Batch mode check failed”
Offline 文件中存在Batch mode错误，请重新转换模型



“FileNotFoundError:”
运行simulator指令时，b.sim模型路径错误



“Not recognized model”
运行simulator指令时，a.bin模型文件格式错误



“ValueError: Got different num of preprocess_methods and images!”
运行simulator指令时，前处理脚本数量与模型inputs数量不相同的错误



“ValueError: Cannot set input tensor: Got tensor of type Unknown TensorType but expected type FLOAT32 for input 0, name: images ”
前处理文件中存在输出格式错误



“ValueError: Cannot set tensor: Dimension mismatch. Got 160 but expected 640 for dimension 1 of input 0, name: images”
前处理文件中存在参数错误



“Assert!!! Errors happened in Image Header BufferSize1 Check Failed!”
Offline 文件中存在BufferSize1 错误



“Assert!!! Errors happened in Image Header BufferSize Check Failed!”
Offline 文件中存在BufferSize错误



“TypeError: the 'package' argument is required to perform a relative import for”
运行simulator指令时,前处理文件类型错误



“im, ratio, (dw, dh) = letterbox(im) SyntaxError: invalid syntax”
运行simulator指令时,前处理文件中存在语法错误



“struct.error: unpack requires a buffer of 3136 bytes”
使用Analysis Tool工具调试时,存在bin文件不完整的错误



“ValueError: max() arg is an empty sequence”
auto_dump_debug.sh脚本的输入文件格式错误



“KeyError: '100'”
运行auto_dump_debug.sh脚本时产生，100为sigma_outtensor_dump_fixed.bin和sigma_outtensor_dump_float.bin中错误的tensor name



“UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte”
运行simulator时，txt文件格式错误



“ValueError: Not support soc_version: xxx”
运行simulator时，soc_version xxx不在支持的芯片列表中



“ValueError: Cannot set tensor: Dimension mismatch. Got 320 but expected 640 for dimension 1 of input 0, name: images”
运行simulator时，前处理文件中存在参数错误



“def image_preprocess(image_file, norm=True)SyntaxError: invalid syntax”
运行simulator时，前处理文件中存在语法错误



“Assert!!! Errors happened in Concat output shape is invalid!”
Offline 文件中存在算子shape错误



“Assert!!! Errors happened in Image Header Batch mode check failed”
Offline 文件中存在Batch mode错误



“RuntimeError: MI_IPU_CreateCHNWithUserMem failed: 54"
运行RPC simulator时，当前芯片型号和模型转换过程中的soc_version不匹配

浮点模型转换	报错现象	原因分析
	“`input_formats` is BGR only support `training_input_formats` is BGR!”	运行ConvertTool指令时，ini配置文件中存在training_input_formats与input_formats配置不匹配错误
	“KeyError: 'output1'”	运行ConvertTool指令时，ini配置文件中存在inputs/outputs与待转换模型不相同错误
	“configparser.MissingSectionHeaderError: File contains no section headers.”	运行ConvertTool指令时，ini配置文件中存在语法错误
	“Loopback in graph is not support!”	运行ConvertTool指令时，模型文件中存在回环图的错误
	“Nodes in a graph must be topologically sorted, however input 'Transpose_7_o0_rewrite' of node:name: Conv13 OpType: Conv is not output of any previous nodes.”	运行ConvertTool指令时，原始模型文件中存在算子执行顺序不对的错误
	“Please export the ONNX with opset_version <= 20”	运行ConvertTool指令时，onnx模型opset版本过高，超出现阶段可支持版本限制
	“ONNX PAD axis should not be given as input”	onnx的PAD算子的axis参数不支持tensor格式
	“ONNX MVN input is expected to have four dimensions”	onnx的MVN算子仅支持4维输入
	“ONNX EXPAND shape input only support const tensor”	onnx的EXPAND算子仅支持const类型的shape信息
	“RNN only support sequence_lens empty tensor”	onnx的RNN算子参数支持限制
	“ONNX RNN not support clip”	onnx的RNN算子参数支持限制
	“Do not support activation_beta”	onnx的RNN算子参数支持限制
	“activation_alpha only support 0.01”	onnx的RNN算子参数支持限制
	“ONNX RNN not support direction reverse”	onnx的RNN算子参数支持限制
	“ONNX RNN activation only support [Sigmoid tanh tanh]”	onnx的RNN算子参数支持限制
	“ONNX Pad only support `constant` and `reflect` mode”	onnx的PAD算子参数支持限制
	“ONNX MOD only support fmod 0”	onnx的MOD算子参数支持限制
	“LSTM unsupport sequence_lens variable tensor”	onnx的LSTM算子参数支持限制
	“ONNX LSTM not support clip”	onnx的LSTM算子参数支持限制
	“Do not support activation_beta”	onnx的LSTM算子参数支持限制
	“activation_alpha only support 0.01”	onnx的LSTM算子参数支持限制
	“ONNX LSTM layout only support 0”	onnx的LSTM算子参数支持限制
	“ONNX LSTM input_forget only support 0”	onnx的LSTM算子参数支持限制
	“ONNX LSTM not support direction reverse”	onnx的LSTM算子参数支持限制
	“ONNX LSTM activation only support [Sigmoid tanh tanh]”	onnx的LSTM算子参数支持限制
	“ONNX LayerNormalization not support stash_type is 0”	onnx的LayerNormalization算子参数支持限制
	“ONNX LayerNormalization not support axis is 0”	onnx的LayerNormalization算子参数支持限制
	“GRU unsupport sequence_lens variable tensor”	onnx的GRU算子参数支持限制
	“ONNX GRU layout only support 0”	onnx的GRU算子参数支持限制
	“ONNX GRU only support forward and bidirectional”	onnx的GRU算子参数支持限制
	“ONNX GRU not support clip”	onnx的GRU算子参数支持限制
	“gru activation only support Sigmoid and tanh”	onnx的GRU算子参数支持限制
	“ONNX GRU not support activation_alpha”	onnx的GRU算子参数支持限制
	“Not support yet!”	算子未支持
	“Exception: ONNX Upsample only support height and width resize:”	检查到模型中有算子规格超过目标芯片的能力范围，具体参考报错语句
	“Exception: ONNX TOPK SGS only support the top-K sorted elements along a specified axis:/TopK”	检查到模型中有算子规格超过目标芯片的能力范围，具体参考报错语句
	“Exception: ONNX TOPK SGS only support the top-K largest elements along a specified axis:/TopK”	检查到模型中有算子规格超过目标芯片的能力范围，具体参考报错语句
	“Exception: ONNX Sum only support 2 inputs:”	检查到模型中有算子规格超过目标芯片的能力范围，具体参考报错语句
	“ValueError: Split_V outputs number exceeds the limit.”	检查到模型中有算子规格超过目标芯片的能力范围，具体参考报错语句
	“sub0 has wrong dynamic shape that cannot be figured out! please check if source shape is graph input”	检查到模型文件中存在算子属性错误导致无法计算出output tensor的错误
	“Exception: ONNX ScatterND reduction attrs didn't support 1 attribute yet!”	onnx的ScatterND算子参数支持限制
	“Exception: ONNX ScatterElements include unsopport reduction type:/ScatterElements”	onnx的ScatterElements算子参数支持限制
	“Exception: ONNX RESIZE only support 'nearest_mode' : round_prefer_floor/floor/round_prefer_ceil:”	onnx的RESIZE算子参数支持限制
	“Exception: ONNX RESIZE only support nearest or linear:”	onnx的RESIZE算子参数支持限制
	“Exception: resize do not support keep_aspect_ratio_policy”	onnx的RESIZE算子参数支持限制
	“Exception: ONNX RESIZE unsupport 'exclude_outside' non-zero value:”	onnx的RESIZE算子参数支持限制
	“Exception: ONNX RESIZE only support 'coordinate_transformation_mode' : align_corners/asymmetric/half_pixel/pytorch_half_pixel:”	onnx的RESIZE算子参数支持限制
	“Exception: resize antialias only support 0”	onnx的RESIZE算子参数支持限制
	“Exception: ONNX RESIZE scale tensor only support height and width resize:”	onnx的RESIZE算子参数支持限制
	“Exception: ONNX RESHAPE unsupport allowedzero set 1:”	onnx的RESHAPE算子参数支持限制
	“DYNSHAPESYM sub lack shape info, please check DYNAMIC_CONFIG in ini!”	检查到模型文件中存在算子属性错误导致无法计算出output tensor的错误
	“Exception: do not support noop_with_empty_axes not 0:”	onnx的ReduceSum/ReduceMin/ReduceMax/ReduceMean/ReduceL2算子参数支持限制
	“Exception: not support onnx op type OneHot yet”	检查到模型文件中存在不支持算子
	“Exception: ONNX DepthToSpace c must be divisible by blocksize^2”	onnx的DepthToSpace算子参数支持限制
	"Exception: ONNX CUMSUM reverse only support is 0:"	onnx的CUMSUM算子参数支持限制
	"Exception: ONNX CUMSUM exclusive only support is 0:"	onnx的CUMSUM算子参数支持限制
	"Exception: ONNX ConvTranspose3d filter node /convtranspose3d/ConvTranspose not support kernel size 3 dilation 1 padding 4 4"	onnx的ConvTranspose算子参数支持限制
	"Exception: ONNX ConvTranspose not support attr output_shape:"	onnx的ConvTranspose算子参数支持限制
	"ValueError: Conv3d strideW strideH can not be larger than 31!!!"	onnx的ConvTranspose算子参数超过目标芯片的能力范围
	"Exception: Not support group conv3d yet:/conv3d/Conv"	onnx的Conv算子参数支持限制
	"Assertion `s32InputCount <= ((10000))' failed."	onnx的CONCAT算子参数超过目标芯片的能力范围
	"Exception: ONNX BatchNormalization not support training_mode not 0:"	onnx的BatchNormalization算子参数支持限制
	"BatchNormalization not support spatial not 1:"	onnx的BatchNormalization算子参数支持限制
	"BatchNormalization not support is_test not 0:"	onnx的BatchNormalization算子参数支持限制
	"Exception: ONNX AveragePool node not support kernel size 180 180 stride 300 300"	onnx的AveragePool算子参数超过目标芯片的能力范围
	"Exception: ONNX AveragePool no support padding over 255"	onnx的AveragePool算子参数超过目标芯片的能力范围
	"Exception: ONNX AveragePool node not support kernel size 300 300 stride 1 1"	onnx的AveragePool算子参数超过目标芯片的能力范围
	"Exception: ONNX AVGPOOL only support dilation 1:"	onnx的AveragePool算子参数超过目标芯片的能力范围
	"Exception: ONNX ARGMAX select_last_index only support 0:"	onnx的ArgMax算子参数支持限制
	"Exception: ONNX ARGMIN select_last_index only support 0:"	onnx的ArgMin算子参数支持限制
	“ValueError: Not support soc_version: xxx”	xxx不在支持的芯片列表中
	“Input graph file {} does not exist!”	模型文件不存在
	“google.protobuf.message.DecodeError: Protobuf decoding consumed too few bytes: 1 out of 736”	运行ConvertTool指令，原始模型文件与配置的模型类型不符
	“`input_formats` is RGB only support `training_input_formats` is RGB!”	ini配置文件中存在training_input_formats与input_formats配置不匹配错误
	“RuntimeError: The model doesn't have input named \"images1\"”	ini配置文件中存在inputs/outputs与待转换模型不相同错误
	“configparser.MissingSectionHeaderError: File contains no section headers.”	ini配置文件中存在语法错误
	“sub 0, tensor [output] has dynamic shape that cannot be figured out!”	存在未知输入导致shape推导无法完成
	“Assert!!! Errors happened in At most one dimension of the new shape can be -1”	reshape算子存在多个-1维度标识

定点模型转换	报错现象	原因分析
	“FileNotFoundError: No images found in”	运行calibrator指令时，txt内文件格式与训练集不匹配或txt内文件不存在或txt文件格式错误
	“void SGS_CheckPoolFilterSize(OperationType, SGS_S32, SGS_S32, SGS_S32, SGS_S32): Assertion `0' failed.”	MXAPOOL的kernel参数超过目标芯片的能力范围
	“ERROR: no schedule info. weight name:/Constant_output_0”	检查到Conv parameters超过硬件spec
	“ValueError: Not support soc_version: xxx”	soc_version xxx芯片型号不在支持的芯片列表中
	“ValueError: Not recognized model.”	运行calibrator指令时模型文件格式错误

离线模型转换	报错现象	原因分析
	“FileNotFoundError: No such model:”	运行Compiler指令时,模型文件不存在
	“ValueError: Not support soc_version: xxx”	运行Compiler指令时soc_version xxx不在支持的芯片列表中
	“ValueError: Not recognized model”	运行compiler指令时模型格式错误

精度调试阶段	报错现象	原因分析
	“Assert!!! Errors happened in Image Header output 0 ExtFlag check failed”	Offline 文件中存在output ExtFlag错误，请重新转换模型
	“Assert!!! Errors happened in Image Header Input 0 ExtFlag check failed”	Offline 文件中存在input ExtFlag错误，请重新转换模型
	“Assert!!! Errors happened in Image Header Input 0 Height alignment check failed”	Offline 文件中存在input Height alignment错误，请重新转换模型
	“Assert!!! Errors happened in Image Header Input 0 Width alignment check failed”	Offline 文件中存在input Width alignment错误，请重新转换模型
	“Assert!!! Errors happened in Image Header output 0 Aligned Buffer Size check failed”	Offline 文件中存在output Aligned Buffer Size错误，请重新转换模型
	“Assert!!! Errors happened in Image Header Input 0 Aligned Buffer Size check failed”	Offline 文件中存在input Aligned Buffer Size错误，请重新转换模型
	“Assert!!! Errors happened in Image Header output 0 Scalar check failed”	Offline 文件中存在output Scalar错误，请重新转换模型
	“Assert!!! Errors happened in Image Header Input 0 Scalar check failed”	Offline 文件中存在input Scalar错误，请重新转换模型
	“Assert!!! Errors happened in Image Header output 0 ZPoint check failed”	Offline 文件中存在output ZPoint错误，请重新转换模型
	“Assert!!! Errors happened in Image Header Input 0 ZPoint check failed”	Offline 文件中存在input ZPoint错误，请重新转换模型
	“Assert!!! Errors happened in Image Header output 0 Shape check failed”	Offline 文件中存在output Shape错误，请重新转换模型
	“Assert!!! Errors happened in Image Header Input 0 Shape check failed”	Offline 文件中存在input Shape错误，请重新转换模型
	“Assert!!! Errors happened in Image Header output 0 Element format check failed”	Offline 文件中存在output Element format 错误，请重新转换模型
	“Assert!!! Errors happened in Image Header Input 0 Element format check failed”	Offline 文件中存在input Element format错误，请重新转换模型
	“Assert!!! Errors happened in Image Header output 0 dim check failed”	Offline 文件中存在output tensor dimension错误，请重新转换模型
	“Assert!!! Errors happened in Image Header Input 0 dim check failed”	Offline 文件中存在input dimension错误，请重新转换模型
	“Assert!!! Errors happened in Image Header Batch mode check failed”	Offline 文件中存在Batch mode错误，请重新转换模型
	“FileNotFoundError:”	运行simulator指令时，b.sim模型路径错误
	“Not recognized model”	运行simulator指令时，a.bin模型文件格式错误
	“ValueError: Got different num of preprocess_methods and images!”	运行simulator指令时，前处理脚本数量与模型inputs数量不相同的错误
	“ValueError: Cannot set input tensor: Got tensor of type Unknown TensorType but expected type FLOAT32 for input 0, name: images ”	前处理文件中存在输出格式错误
	“ValueError: Cannot set tensor: Dimension mismatch. Got 160 but expected 640 for dimension 1 of input 0, name: images”	前处理文件中存在参数错误
	“Assert!!! Errors happened in Image Header BufferSize1 Check Failed!”	Offline 文件中存在BufferSize1 错误
	“Assert!!! Errors happened in Image Header BufferSize Check Failed!”	Offline 文件中存在BufferSize错误
	“TypeError: the 'package' argument is required to perform a relative import for”	运行simulator指令时,前处理文件类型错误
	“im, ratio, (dw, dh) = letterbox(im) SyntaxError: invalid syntax”	运行simulator指令时,前处理文件中存在语法错误
	“struct.error: unpack requires a buffer of 3136 bytes”	使用Analysis Tool工具调试时,存在bin文件不完整的错误
	“ValueError: max() arg is an empty sequence”	auto_dump_debug.sh脚本的输入文件格式错误
	“KeyError: '100'”	运行auto_dump_debug.sh脚本时产生，100为sigma_outtensor_dump_fixed.bin和sigma_outtensor_dump_float.bin中错误的tensor name
	“UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte”	运行simulator时，txt文件格式错误
	“ValueError: Not support soc_version: xxx”	运行simulator时，soc_version xxx不在支持的芯片列表中
	“ValueError: Cannot set tensor: Dimension mismatch. Got 320 but expected 640 for dimension 1 of input 0, name: images”	运行simulator时，前处理文件中存在参数错误
	“def image_preprocess(image_file, norm=True)SyntaxError: invalid syntax”	运行simulator时，前处理文件中存在语法错误
	“Assert!!! Errors happened in Concat output shape is invalid!”	Offline 文件中存在算子shape错误
	“Assert!!! Errors happened in Image Header Batch mode check failed”	Offline 文件中存在Batch mode错误
	“RuntimeError: MI_IPU_CreateCHNWithUserMem failed: 54"	运行RPC simulator时，当前芯片型号和模型转换过程中的soc_version不匹配