MI SYS Debug SOP


REVISION HISTORY

Revision No.
Description
Date
1.0
  • Initial release
  • 12/21/2023

    Preface

    This document is written for FAE and software development personnel, aiming to introduce how to conduct preliminary troubleshooting when customers encounter MI SYS-related issues during the development process, and to provide relevant information to RD for analysis after confirming it is a Sigmastar SDK issue.


    1. MMU Memory Violation Issues

    1. The following log indicates that a MMU memory violation has occurred.

      [MI WRN] MI_SYS_Mma_MmuCallback[227]: [MI_SYS_Mma_MmuCallback] Status=0x2, PhyAddr=0x40070000, ClientId=0x70,Name=CPU_CA35 IsWrite=0
      
      • Status
        • The reason for triggering this callback, the current value is 0x2, which indicates an illegal read/write operation.
      • PhyAddr

        • The starting address of the entry that triggered the exception, the current value is 0x40070000.
      • ClientId

        • Module ID, the current value is 0x70.
      • Name

        • Module name, the current value is CPU.
      • IsWrite

        • Read/write operation, 1 means write, 0 means read, the current value is 0, indicating a read operation.
    2. Analysis

      • Before running the application, enter the following command to collect detailed information about MMA memory allocation and release.

        echo debug_mmu debug_log 1 > /proc/mi_modules/mi_sys/mi_sys0
        

        If PhyAddr is located between the fr and ed of a released buff, it indicates that the module accessed already released memory.

        al:64000 sz:cac0 ed:70ac0 na:app-mmaAlloc
        fr:64000 sz:cac0 ed:70ac0  na:app-mmaAlloc
        [MI WRN] MI_SYS_Mma_MmuCallback[227]: [MI_SYS_Mma_MmuCallback] Status=0x2, PhyAddr=0x40070000, ClientId=0x70,Name=CPU_CA35 IsWrite=0
        

        From the log, we can see that PhyAddr= 0x40070000, the starting address of vpa can be obtained from /proc/mi_modules/mi_sys_mma/mma_heap_name0, the current vpa starting address is 0x40000000, so the offset of PhyAddr relative to the starting address of vpa is 0x70000, which is within the free memory range (0x70000 is within the range of 0x64000-0x70ac0), it is highly likely that the CPU accessed the released app-mmaAlloc buff.

      • If the PhyAddr printed by MmuCallback is not within the already freed memory address range, it is suspected that there are too many logs, and some logs have been overwritten. At this point, if the module that caused the memory violation has been identified, you can filter the memory allocation information for the corresponding module.

        echo debug_mmu debug_log [enable/disable] [enable/disable free]  [enable/disable alloc] [module id list] > /proc/mi_modules/mi_sys/mi_sys0
        

        Example:

        • Enable vif ai alloc/free buf log
          echo debug_mmu debug_log 1 1 1 4,6 > /proc/mi_modules/mi_sys/mi_sys0
          
      • If the above does not reveal any access to freed buffs, then enable debugMmu for debugging, modify the /config/modparam.json file, and add {"debugMmu" : 1} under E_MI_MODULE_ID_SYS, then restart the machine and run the application.

        "E_MI_MODULE_ID_SYS" :
        {
        "cmdQBufSize" : 400,
        "debugmmu": 1
        },
        

        At this point, when HW reads or writes multiple times, it can capture 100%, printing logs similar to the following.

        al:1480000 sz:320000 ed:17a0000 na:vde_rec_0_0_0
        [MI WRN] MI_SYS_Mma_MmuCallback[425]: [MI_SYS_Mma_MmuCallback] Status=0x2, PhyAddr=0x401780000, ClientId=0x38,Name=VENC0_CODEC0_R  IsWrite=0
        

        First, calculate the entry size: after booting, cat /proc/mi_modules/mi_sys_mma/mma_heap_name0, where the length of heap_info divided by freeEntryNum is the entry size. At this time, if mmu entry size is set to 128K, if PhyAddr+entry size=ed, it indicates that the module has performed a memory out-of-bounds access. In the above log, 0x17a0000 (ed) = 0x1780000 + 0x20000 (entry size), indicating that the VENC0_CODEC0_R module has performed a memory out-of-bounds access.

      • Find the corresponding module owner based on the Name, provide the above printed log and inform them of the analysis conclusion.

    2. MIU Protect Issues

    When encountering the following log, it indicates that a MIU protect has occurred:

    [HAL MIU ERR] [hal_miu_protect_show_hit_info@1301] [PROTECT] IP hit protect address.
    Hit block 32, address: 0x400800000<->0x400802000, whitelist: [CPU_CA55, ]
    Hit count: 1
    Hit IP : 0x45 - BDMA
    Hit MMU address: 0x400800000<->0x40080000f
    

    You can follow the steps below for analysis and localization:

    1. The reason for the MIU protect: "IP hit protect address" indicates that an IP outside the whitelist accessed a protected address; "IP write out of dram" indicates that the IP accessed an address beyond the DRAM range or encountered a MMU memory violation issue. You need to disable kernel panic and observe whether MMU memory violation logs appear to further confirm the reason. (Refer to Point 5 for how to disable the kernel.)
    2. From the log information, it can be seen that the problematic IP is BDMA, which accessed the address 0x400800000<->0x40080000f in block 32 (the address range of this block is 0x400800000<->0x400802000, and the whitelist is set to allow only CPU_CA55 to access).
    3. Seek help from the BDMA IP owner, providing the log and the current usage of BDMA, parameter calls, and other information.
    4. By default, a kernel panic occurs after a MIU protect. If you do not want a panic to occur when a problem arises, you can set the "g_bEnableMiuPanic" field in modparam.json to false. After setting, you need to restart or reload mi_sys.ko.
    5. The current MIU protect status of the system can be obtained by cat /proc/mi_modules/mi_sys_mma/miu_protect. The default policy is to protect all memory in kernel lx_mem except for MMA, setting the BSP-related IPs to the whitelist, and disallowing stream pipe-related IPs from accessing.

    3. Insufficient Frame Rate Issues

    The insufficient frame rate issue is specifically manifested as:

    The player pulling the stream does not reach the set fps value.

    If an insufficient frame rate issue occurs, you can follow the steps below for analysis and localization:

    1. watch -n 1 'cat /proc/mi_modules/mi_venc/mi_venc*'

      ---------------------------- OutputPort of dev: 0 ----------------------------- ChnId CODEC Profile BufSize MinAllocSize RefNum bByFrame FrameCnt DropCnt ReEncCnt RingUnreadCnt RingTotalCnt UsrLockedCnt 0 H265 0 4147200 7273 0 1 871390 0 0 0 0 0

      -----------------------------------Input port common info for mi_venc only dump enabled port----------------------------------- ChnId PassId PortId user_buf_quota UsrInjectQ_cnt BindInQ_cnt TotalPendingBuf_size usrLockedInjectCnt 0 0 0 4 0 4 37601280 0 ChnId PassId PortId curWorkingQ_cnt workingTask_cnt lazzyRewindTask_cnt 0 0 0 3 1 0

    Check whether RingUnreadCnt/RingTotalCnt/UsrInjectQ_cnt/BindInQ_cnt varies between 2 and 4. If so, it indicates that the APP is not pulling the stream fast enough. To confirm that the APP is indeed not pulling the stream fast enough, you can directly drop the encoded ES inside mi venc.

    echo drop_out a d > /proc/mi_modules/mi_venc/mi_venc0
    

    At this point, observe the Fps_1s field in the mi venc procfs to see if the frame rate has improved.

    ChnId  State  EnPred  base  enhance MaxStreamCnt  Fps_1s    kbps1s  Fps10s    kbps10s  lastQp
        0      0       0     0        0            3   29.99       964   29.98        929      20
    
    1. From the downstream module to the upstream module, sequentially cat /proc/mi_modules/mi_xxx/mi_xxx*

    Check whether the DropCnt of the InputPort continues to increase. A continuous increase indicates that the corresponding module is dropping frames internally.

    ChnId  State  EnPred  base  enhance MaxStreamCnt  Fps_1s    kbps1s  Fps10s    kbps10s  lastQp
        0      0       0     0        0            3   12.99       964   13.58        929      20
    

    Continue to check whether GetIntoMaxCnt/GetIntoMmaLackCnt/GetIntoTotalCnt/GetIntoFrcCnt are continuously increasing.

      ChnId  PassId  PortId    GetFrame/Ms    FPS   FinishCnt   RewindCnt  GetTotalCnt    GetOkCnt    GetIntoMaxCnt  GetIntoMmaLackCnt   GetIntoTotalCnt  GetIntoFrcCnt
          0       0       0        13/1071  12.13      982123          10      1903946      982133                0                  0            921813              0
    

    -- An increase in GetIntoMaxCnt indicates that MaxEnqTasks is full; you can try increasing it. If effective, you need to communicate with the corresponding MI module owner to modify MaxEnqTasks in the driver.

    echo set_Max_EnqTasks <eModuleId> <u8DevId> <u8PassId> <u16ChannelId> <EnqTasksNum> > /proc/mi_modules/mi_sys/mi_sys0
    
    echo set_Max_EnqTasks 23 0 1 0 5 > /proc/mi_modules/mi_sys/mi_sys0
    

    -- An increase in GetIntoMmaLackCnt indicates that MMA is insufficient; you can try increasing it via bootargs. If effective, modify the value of CONFIG_MMA_MEM_SIZE in defconfig to the adjusted size.

    -- An increase in GetIntoTotalCnt indicates that the output port's buffer may be insufficient; you can try increasing it. If effective, modify the APP code to change the MI module output port depth. If ineffective, it may be that the downstream module is processing too slowly, causing buffer accumulation in the downstream module. In this case, you need to analyze the reason for the slow processing of the downstream module.

    echo set_ouputport_depth [Modid] [Devid] [Chnid] [Passid] [Portid] u32UserFrameDepth u32BufQueueDepth > > /proc/mi_modules/mi_sys/mi_sys0
    
    echo set_ouputport_depth 23 0 0 1 0 0 6 > /proc/mi_modules/mi_sys/mi_sys0
    

    -- An increase in GetIntoFrcCnt indicates that frames are being dropped due to frame rate control, which generally means that frame rate control is indeed present.

    1. Based on step 2, distinguish which module and reason it is, and then you can find the module owner for specific analysis to see if the module HW time consumption is normal, BW is normal, and whether the pipe exceeds specifications.

    2. Enter the sdk/verify/mi_demo directory, make source/miu_server.

    3. Copy sdk/verify/mi_demo/out/arm/app/prog_miu_server to the board.
    4. Run ./prog_miu_server on the board.
    5. Use the BWLA Tool to connect to the board and check the BW data.

    4. Insufficient MMA Memory Issues

    The insufficient MMA memory issue is specifically manifested as:

    1. The application or module returns failure when calling memory allocation interfaces such as MI_SYS_MMA_Alloc.
    2. The module experiences frame drops, and the "GetIntoMmaLackCnt" field is found to be increasing through the corresponding module's proc info.
    3. Others.

    Cat /proc/mi_modules/mi_sys_mma/mma_heap_name0 to check the current MMA usage of the system, focusing on the "length", "avail", "used", and "HighPeak" fields. If avail is less than the size to be allocated, or length-HighPeak is less than the size to be allocated, it indicates a potential memory shortage issue; otherwise, it indicates a potential memory fragmentation issue.

    If a memory shortage issue occurs, you can follow the steps below for analysis and localization:

    1. Increase MMA memory:

      1. If it is a pure Linux system, modify the environment variable bootargs field to increase the size of mma_heap_name0.
      2. If it is a dual OS system, you can increase mma_size by modifying the environment variable bootargs and correspondingly reduce the sizes of LX_MEM and mma_base. LX_MEM requires 1M alignment. The mma_base and mma_size in bootargs_rtos must be consistent with those in bootargs.
    2. Check the specific usage of memory in /proc/mi_modules/mi_sys_mma/mma_heap_name0, reviewing whether there are memory leaks (whether MMA buffers are continuously increasing, whether there are MMA buffers that have not been reclaimed after destroying the pipeline).

    3. /proc/mi_modules/mi_sys_mma/mma_heap_name0 cannot present dynamic information on memory allocation and release; you can use the following command for further confirmation. When MMA buffer allocation fails, it will print the current memory information of the MMA heap. You can further analyze whether it is due to insufficient memory, leakage, or fragmentation.

      echo MI_SYS_Chunk_DumpMgrPrintk > /proc/mi_modules/mi_debug/debug_func

      echo 7 > /proc/mi_modules/mi_debug/debug_level

    4. Adjust the pipeline, and try to use real-time, hwring, and other memory-saving binding methods between modules.

    5. Provide scenario diagrams, /proc/mi_modules/mi_sys_mma/mma_heap_name0 information, and kmsg prints to the mi_sys owner for analysis and conclusions.

    If a memory fragmentation issue occurs, you can follow the steps below for analysis and localization:

    1. Enable MMU by setting the "bEnableMmu" field in modparam.json to true; after setting, you need to restart or reload mi_sys.ko.
    2. Optimize the application, using a private pool to limit fragmentation issues to a certain extent. (For the use of private pools, please refer to the usage instructions of the MI_S32 MI_SYS_ConfigPrivateMMAPool() interface in the MI SYS API.)
    3. Increase MMA memory.

    5.1. CMDQ Timeout

    When CMDQ experiences a timeout, it is usually because a HW interrupt has not occurred or a certain register setting does not meet expectations. At this time, the following information will be output to the serial port:

    [CMDQ]cmdq(0) ERR: WAIT_TRIG_TIMEOUT (0x00000400)
    [CMDQ]Cmd data = 0x2000 : 0x0000 : 0x0000 : 0xFFFE
    [CMDQ]Cmd:WAIT, dbg:0, adr:000000, dat:0000, mask:fffe
    [CMDQ]Wait command timeout. Trigger_Bus Bit [0] Event [ISP_CMDQ_DONE]
    [CMDQ]dram(s, e, w, r)=(0x40000000, 0x40003ff0, 0x40000020, 0x40000020) sram(w,r)=(002,002)
    
    [CMDQ]cmdq(7) ERR: POLLEQ_TIMEOUT (0x00000200)
    [CMDQ]Cmd data = 0x3008 : 0x94F8 : 0xFF00 : 0x00FF
    [CMDQ]Cmd:POLLEQ, dbg:0, adr:1129f0, dat:ff00, mask:00ff
    [CMDQ]Poll eq command timeout.  Bank:0x1129; Offset:0x78; Value:0xff00; Mask:0x00ff
    [CMDQ]dram(s, e, w, r)=(0x4001c000, 0x4001fff0, 0x4001c030, 0x4001c030) sram(w,r)=(003,003)
    

    First, determine which module's CMDQ is having issues. There are two ways to do this:

    5.1.1. Wait Trigger Timeout

    When the timeout type is WAIT_TRIG_TIMEOUT, there will be an Event [ISP_CMDQ_DONE], indicating that it is an ISP issue.

    5.1.2. Other

    Additionally, if the first line of each log contains the CMDQ ID, it indicates the CMDQ ID that experienced the timeout. We can check the CMDQ ID used by the specified module by cat /proc/mi_modules/mi_xx/mi_xx, as follows:

    ----------------------CMDQ kickoff counter-----------------------
    DevId    current_buf_size  Peak_buf_size
        0                   0              0
    each dev buf info:
                offset              length           used_flag           task_name
    

    Compare with the problematic CMDQ ID to determine which module it is, and finally find the corresponding module's SW owner to analyze the log and running scenario.