7. IPU Toolchain 算子支持

1. Caffe支持算子¶

算子	备注
ArgMax	only support top 1
Axpy
BatchNorm
Concat	最大1024个tensor concat
Convolution	限制条件：All tensor size < 2^31 若kernel size 为 dohwdi 则h w < 64 group为1时：转换成Depthwise Convolution; group为C时：转换为Convolution; group为 (1, C) 时：拆解成GroupConv round(di/16)round(do/16) < 512 1024
ConvolutionDepthwise	原生支持Kernel_size为33，66，9*9，其余情况转换成Convolution处理限制条件：Pad范围：[0, 1]
CReLU	输入<=4维
ContinuationIndicator
Crop
Deconvolution	All tensor size < 2^31 若kernel size 为 dohwdi，h w < 64 round(di/16)round(do/16) < 512 1024
Dropout
Eltwise	其中PROD和SUM，对于输入的两个tensor,当为4维向量时，满足下面条件 1. NCHW ，const 2. NCHW ，C维向量 3. NCHW ，NCHW 当为5维向量时，满足 1. NCDHW ，const 2. NCDHW ，NCDHW
Flatten
InnerProduct	若weight size dodi round(di/16)round(do/16) < 512 * 1024
Permute
Pooling	若kernel size为hw 1.AVGPooling (1).FilterW <= 255，FilterH <= 255， (2).AvePooling_U8：FilterMax(FilterWFilterH) = 12288， (3).AvePooling_S16：FilterMax(FilterWFilterH) = 12288 2.MaxPooling: 需FilterW <= 255，且FilterMax(FilterWFilterH) = 6029312
PriorBox
Power	仅支持指数为正整数
Reshape
Reverse
ROIPooling	ROIPooling的rois输入维度为（N×5），当后段网络全部是InnerProduct时，N才可以设置大于1，如果后段网络中有卷积时，N仅可以设置为1，第二段网络需要循环执行N次。使用方法和限制详见下方Please Note。
ReLU	输入<=4维
PRuLU	输入<=4维
Sigmoid
Slice
Scale	对于输入的两个tensor,shape满足下面条件 1. 4维向量，NCHW 2. NCHW ，const 3. NCHW ，C维向量 4. NCHW ，NCHW 当为5维向量时，满足 1. NCDHW ，const 2. NCDHW ，NCDHW
Softmax	如需对指定维度进行运算，将要计算的维度转置到最后的维度（最内维度），最大支持32*512=16384
Splite
ShuffleChannel
Tanh	输入<=4维
Threshold	只支持4维输入
Tile
Upsample	Upsample算子在caffe中没有，可以手动将Deconvolution修改成Upsample 只支持4维输入 Only support same scale on H and W
Reorg	只支持stride = 2
LSTM	支持单向，双向

Please Note:

Upsample算子在prototxt中这样描述：

layer {
    bottom: "layer85-conv"
    top: "layer86-upsample"
    name: "layer86-upsample"
    type: "Upsample"
    upsample_param {
        scale: 2
    }
}

scale参数与Deconvolution的Stride含义相同。但需注意Upsample相当于权重全为1的Deconvolution算子。

ROIPooling算子在prototxt中这样描述：

layer {
    name: "roi_pool5"
    type: "ROIPooling"
    bottom: "conv5_3"
    bottom: "rois"
    top: "pool5"
    roi_pooling_param {
        pooled_w: 7
        pooled_h: 7
        spatial_scale: 0.0625
    }
}

Roi_pooling_param仅支持pooled_w，pooled_h和spatial_scale。Float模型的rois输入为rpn层输出的坐标，Fixed和Offline模型的rois输入为rpn层输出坐标乘spatial_scale值后再量化到int16后送入模型。

2. TensorFlow支持算子¶

算子	备注
Convolution	限制条件：Kernel_size：H * W < 64
DepthwiseConv2dNative	原生支持Kernel_size为33，66，9*9，其余情况转换成Convolution处理
FullyConnected
Max pooling
Average Pooling
ReLU
PReLU
ReLU6
LeakyReLU
Sigmoid
Less
Log
Greater
GreaterEqual
Equal
Add
Sub
Mul
RealDiv	仅支持第二个操作数为常量Tensor
FloorDiv	仅支持第二个操作数为常量Tensor
Maximum
Minimum
Mean
Max
Sqrt
Sin
Cos
Rsqrt
Round
Softmax	如需对指定维度进行运算，将要计算的维度转置到最后的维度（最内维度）
FusedBatchNorm
Exp
Align
ConcatV2
Fill
Gather	仅支持第二个操作数indices为常量Tensor
GatherV2
Pack
Pad
SpaceToBatchND
BatchToSpaceND
Zeroslike
Split
Slice
Unpack
Tile
Reshape
Transpose
Resize_bilinear
Resize_NearestNeighbor
Batch_matmul
TopKV2
Tanh
Concatenation
Argmax
Logistic
TransposeConv
Square
StrideSlice
Abs
Sum
Cast

3. Onnx支持算子¶

算子	备注
Abs	Unlimited
Add	Unlimited
And	Unlimited
ArgMax	axis: Unlimited keepdims: Unlimited select_last_index: Can only be set to 0
ArgMin	axis: Unlimited keepdims: Unlimited select_last_index: Can only be set to 0
Atan	Unlimited
AveragePool	auto_pad: Supported, SAME_UPPER, SAME_LOWER or VALID ceil_mode: Supported kernel_shape: If kernel_shape is h*w, it must meet FilterW <= 255, FilterH <= 255 pads: Supports two dimensions, both [0,255] strides: Supports two dimensions, both [0,255] count_include_pad: Unlimited dilation: Only supports 1 If KernelW == InputW and pad_W is 0, then there are no [0, 255] restrictions on kernelW and strideW; Based on this, if KernelH == InputH and pad_H is 0, then kernelH and strideH also have no [0, 255] restrictions
BatchNormalization	epsilon: Unlimited momentum: Not supported training_mode: Does not support non-0 is_test: Does not support non-0 spatial: Does not support non-1
Cast	to: Supports float32/float64/int64/int32/bool saturate: Not supported
Ceil	Unlimited
Clip	Unlimited
Concat	最大支持10000个tensor concat
Constant	Unlimited
ConstantOfShape	Unlimited
Conv	conv1d: All tensor size < 2^31 auto_pad: Supported, SAME_UPPER, SAME_LOWER or VALID dilations: Supported, two dimensions, both [0, 31], exceeding will be optimized internally group: Less than or equal to C kernel_shape: Supported, h * w < 100 pads: Supported, four dimensions, all [0,15], exceeding will generate a separate pad operator strides: Supported, two dimensions, both [0, 31] conv2d: All tensor size < 2^31 auto_pad: Supported, SAME_UPPER, SAME_LOWER or VALID dilations: Supported, two dimensions, both [0, 31], exceeding will be optimized internally group: Less than or equal to C kernel_shape: Supported, h * w < 100 pads: Supported, four dimensions, all [0,15], exceeding will generate a separate pad operator strides: Supported, two dimensions, both [0, 31] conv3d: All tensor size < 2^31 auto_pad: Supported, SAME_UPPER, SAME_LOWER or VALID dilations: Supported, two dimensions, both [0, 31], exceeding will be optimized internally group: Not supported kernel_shape: Supported, h * w < 100, d dimension Unlimited pads: Supported h, w, d three dimensions six directions, all [0, 15], exceeding will generate a separate pad operator strides: Supported h, w two dimensions, both [0, 31], d dimension Unlimited
ConvTranspose	All tensor size < 2^31 auto_pad: Supported, SAME_UPPER, SAME_LOWER or VALID dilations: Supported, two dimensions, both [0, 31] group: Less than or equal to C kernel_shape: Supported, h * w < 100 output_padding: Supported output_shape: Not supported pads: Supported strides: Supported, two dimensions
Cos	Unlimited
CumSum	exclusive: Only supports setting to 0 reverse: Only supports setting to 0
DepthToSpace	blocksize: Supported, the input c dimension must be divisible by blocksize^2 mode: DCR or CRD
Div	Unlimited
Dropout	is_test: Not supported ratio: Not supported seed: Not supported
Einsum	equation: Unlimited Supports both single and double operands, does not support expressions in omitted dimension format
Elu	Unlimited
Equal	Unlimited
Erf	Unlimited
Exp	Unlimited
Expand	Unlimited
Flatten	Unlimited
Floor	Unlimited
GRU	activation_alpha: Not supported activation_beta: Not supported activations: Only supports sigmoid/tanh clip: Not supported direction: Supports forward and bidirectional layout: Only supports 0 linear_before_offset: Supports 0 and 1 sequence_lens in the input does not support variable
Gather	axis: Unlimited indices input supports const and variable
GatherElements	axis: Unlimited indices input supports const and variable
GatherND	axis: Unlimited indices input supports const and variable
Gelu	approximate: Supports None and tanh
Gemm	alpha:Unlimited beta:Unlimited transA:Unlimited transB:Unlimited
GlobalAveragePool	Unlimited
GlobalMaxPool	Unlimited
Greater	Unlimited
GreaterOrEqual	Unlimited
HardSigmoid	alpha:Unlimited beta:Unlimited
HardSwish	Unlimited
Identity	This operator will be removed during the conversion process
InstanceNormalization	epsilon:Unlimited
LSTM	activation_alpha: Not supported activation_beta: Not supported activations: Only supports sigmoid/tanh clip: Not supported direction: Only supports forward and bidirectional input_forget: Only supports 0 layout: Only supports 0 sequence_lens in the input does not support variable
LayerNormal	axis: Does not support 0 epsilon: Unlimited stash_type: Not supported
LayerNormalization	axis: Does not support 0 epsilon: Unlimited stash_type: Not supported
LeakyRelu	alpha:Unlimited
Less	Unlimited
LessOrEqual	Unlimited
Log	Unlimited
LogSoftmax	Unlimited
Logcompress	Unlimited
MatMul	Unlimited
Max	Unlimited
MaxPool	auto_pad: Supported, SAME_UPPER, SAME_LOWER or VALID ceil_mode: Supported kernel_shape: Supported pads: Supported strides: Supported storage_order: Not supported 1, calculated according to 0 If the kernel size is h*w, it must meet FilterW <= 255, FilterH <= 255
MeanVarianceNormalization	Input only supports four dimensions axes: Unlimited
Min	Unlimited
Mod	Unlimited fmod can only be 0
Mul	Unlimited
Neg	Unlimited
Not	Unlimited
Or	Unlimited
PRelu	Unlimited
Pad	mode: Supports constant/reflect value: Unlimited Does not support axes as input
Pow	Unlimited
RNN	activation_alpha: Not supported activation_beta: Not supported activations: Only supports tanh clip: Not supported direction: Only supports forward and bidirectional layout: Only supports 0 sequence_lens in the input only supports empty tensor, does not support variable
Range	Unlimited
Reciprocal	Unlimited
ReduceL2	axes: Unlimited keepdims: Unlimited noop_with_empty_axes: Only supports 0
ReduceMax	axes: Unlimited keepdims: Unlimited noop_with_empty_axes: Only supports 0
ReduceMean	axes: Unlimited keepdims: Unlimited noop_with_empty_axes: Only supports 0
ReduceMin	axes: Unlimited keepdims: Unlimited noop_with_empty_axes: Only supports 0
ReduceSum	axes: Unlimited keepdims: Unlimited noop_with_empty_axes: Only supports 0
Relu	Unlimited
Reshape	allowzero: Only supports 0, does not support 0 values in the output shape
Resize	Only supports resizing for hw antialias: Only supports 0 coordinate_transformation_mode: Supports align_corners/asymmetric/half_pixel/pytorch_half_pixel mode: Supports nearest or linear cubic_coeff_a: Not supported exclude_outside: Only supports 0 extrapolation_value: Not supported keep_aspect_ratio_policy: Not supported nearest_mode: Supports, [round_prefer_floor, floor, round_prefer_ceil]
Round	Unlimited
Scatter	Unlimited
ScatterElements	indices support is const tensor and variable tensor axis: Unlimited reduction: Only supports None
ScatterND	indices support is const tensor and variable tensor reduction: Only supports None
Shape	end:Unlimited start:Unlimited
Sigmoid	Unlimited
Sign	Unlimited
Sin	Unlimited
Slice	axes:Unlimited ends:Unlimited starts:Unlimited
Softmax	Unlimited
Softplus	Unlimited
SpaceToDepth	blocksize：Unlimited
Split	axis: Unlimited split: Supports up to a maximum of 10000 num_outputs: Supported
Sqrt	Unlimited
Squeeze	axes:Unlimited
Sub	Unlimited
Sum	Only supports two inputs
Tanh	Unlimited
Tile	Unlimited
TopK	axis: Unlimited largest: Only supports 1 sorted: Only supports 1 k: Unlimited
Transpose	perm:Unlimited
Unsqueeze	Unlimited
Upsample	Only supports Upsample in the HW dimension, and the scale must be the same mode: Supports nearest and linear height_scale: Unlimited width_scale: Unlimited scales: Unlimited
Where	Unlimited

4. SGS_CHALK支持算子¶

SGS_CHALK各算子具体使用方法请参考: sgs_chalk模块API

算子	备注
Abs
Add
Alpha_Blending
ArgMin
Argmax
Atan
Atan2
AveragePool2d
AveragePool3d
BatchMatMul
BatchToSpaceND
BoxDecoder
BoxDecoder2
Cast
Ceil
Clip
Concatenation
CondGreat
CondLess
Conv2d
Conv3d
Conv3dImageConcat
Cos
Cumsum
CustomNotEqual
CustomPow
CustomizedMaxpool2d
DepthWiseConv2d
Dilation
Div
Elu
Equal
Erf
Exp
Expand_dims
Fill
Floor
Fullyconnected
GRU
Gather
GatherElements
GatherND
Gelu
Greater
GreaterEqual
GroupConv2d
HardSwish
Input
Instancenorm
L2Norm
LSTM
LSTM_caffe
Layernorm
LeakyRelu
Less
LessEqual
Log
Logcompress
LogicalAnd
LogicalNot
LogicalOr
Logistic
MaxPool2d
MaxPool3d
Maximum
Mean
MeanVarianceNorm
Minimum
MirrorPad
Mod
Mul
MultiplyAdd
Negative
NotEqual
Pack
Pad
PhaseModify
PostProcess_Max
PostProcess_Unpack
Prelu
RNN
RSqrt
Range
Reciprocal
ReduceMax
ReduceMin
Relu
Relu1
Relu6
Relu_N1_TO_1
Reshape
ResizeBilinear
ResizeNearestNeighbor
RoiPooling
RootSumSquares2
Round
ScatterElements
ScatterND
Score_Filter
Select
Shape
Sign
Silu
Sin
Slice
Softmax
Softplus
SpaceToBatchND
Split
Split_V
Sqrt
Square
Squeeze
StridedSlice
Sub
Sum
TFLite_Detection_NMS
Tanh
Tile
TopK
Transpose
TransposeConv2d
Unpack
WiggleErr

5. IPU Toolchain对模型的限制¶

对于指定维度的Softmax，我们只支持对最内维度的操作（多余多维Tensor的Softmax运算，我们只支持Softmax制定在最内维度做）。

除第一层Conv外，其他层的Conv DI维度（即NHWC 中C这个维度）越大效率会越高，最大支援2048。

Math类算子（包括Add、Sub、Mul、Div等元素操作的算子），如果右操作数是scaler（单个数字）或者 1 维向量（HW维度数据相同，C维度不同），效率会更高。

网络结构中尽量减少一个算子的输出被多个算子作为输入的情况，如ResNet的残差结构，GoogLeNet的Inception模块等。可参考构建BW友好型模型

6. 模型性能优化规则¶

（1）对于卷积的性能优化

kernel size 3x3 最好，特别对于首层。

kernel size 1x1 的时候，input tensor 最内维度shape值对齐到16 最好。

（2）对于DMA算子

concatenation算子比pack算子性能更好。

split比slice算子性能更好。

尽量减少在最内维上做transpose。

单口elementwise 算子的const 操作数最好是右操作数，即input[1]。

（3）综合部分

Tensor的维度为4最好。

Tensor最内维的shape值对齐到32最好。

Softmax最好只对最内维度操作。

ReduceMax、ReduceMin、ReduceSum最好坍塌的纬度是相邻的。

（4）构建BW友好的AI模型

BW友好的AI模型指模型数据尽可能不占用系统总线BW资源，BW计算公式如下:

BW = Input BW + Output BW + Const BW + Variable BW

其中，可以通过减少variable bw来提升模型性能,variable bw将受以下两点影响：

(1) tensor数据量太大

(2) tensor的生命周期太长

示例一：BW友好型

友好型示例

示例二：tensor的生命周期太长

不友好型示例一

示例三：tensor数据量太大

不友好型示例二

上图展示了三种常见的模型结构图，其中图一所示的结构符合模型优化规则，图二、图三所示的结构将影响模型性能。构建模型时请尽可能减少图二、图三结构的使用。