背景:
该问题是实际项目中,生产产线反馈的一例机器反复重启。
原因分析:
通过导出的日志可以看到死机原因是发生非法访问,日志如下:
点击查看代码
LR :0007a18f PC :0005c7e0
Memory management fault is caused by data access violation
The memory management fault occurred address is 1100b2e4
Thread data
r0 = 0x1000b2cc
r1 = 0x643b0010
r2 = 0x1100b2cc
r3 = 0x00000002
r4 = 0x1100a144
r5 = 0x1000db70
r6 = 0x14600b04
r7 = 0x14600aac
r8 = 0x00000001
r9 = a5a5a5a5a5
r10 = 0xa5a5a5a5
r11 = 0xa5a5a5a5r
12 = 0x0006e6f3
lr = 0x0007a18f
pc = 0x0005c7e0
r14 = 0x00000000
根据死机是的PC直,找到代码具体位置,无法看出死机的具体原因,如下:
点击查看代码
uint32_t
am_hal_mspi_configure(void *pHandle,const am_hal_mspi_config_t *pConfig)
{am_hal_mspi_state_t *pMSPIState = (am_hal_mspi_state_t *)pHandle;uint32_t ui32Module;
.......//// Set XIP defaults//MSPIn(ui32Module)->DEV0SCRAMBLING = 0;MSPIn(ui32Module)->DEV0AXI = 0;g_MSPIState[ui32Module].pTCB = pConfig->pTCB;//**PC=0x5c7e0, 发生非法访问**g_MSPIState[ui32Module].ui32TCBSize = pConfig->ui32TCBSize;
......
}
再具体看下死机位置的反汇编代码,如下:
点击查看代码
/home/jenkins/workspace/firmware_auto_trigger/build/..//platform/mcu/apollo4/./sdk/mcu/apollo4l/hal/mcu/am_hal_mspi.c:13615c7d0: f8c2 4080 str.w r4, [r2, #128] ; 0x80
/home/jenkins/workspace/firmware_auto_trigger/build/..//platform/mcu/apollo4/./sdk/mcu/apollo4l/hal/mcu/am_hal_mspi.c:13635c7d4: 4c23 ldr r4, [pc, #140] //r4加载到的值不对; (5c864 <am_hal_mspi_configure+0xc4>)5c7d6: 684d ldr r5, [r1, #4]5c7d8: f640 02c4 movw r2, #2244 ; 0x8c45c7dc: fb02 4203 mla r2, r2, r3, r4 //r2=r4 +r2*r3,r4 = 0x1100a144是错误值 5c7e0: 6195 str r5, [r2, #24] //死机位置:0x1100b2cc+24=0x1100b2e4(非法地址)
/home/jenkins/workspace/firmware_auto_trigger/build/..//platform/mcu/apollo4/./sdk/mcu/apollo4l/hal/mcu/am_hal_mspi.c:13645c7e2: 680d ldr r5, [r1, #0]5c7e4: 6155 str r5, [r2, #20]
......5c860: 01bebebe .word 0x01bebebe5c864: 1000a144 .word 0x1000a144 //r4从代码区域读取的常量5c868: 1005ffff .word 0x1005ffff
-
死机位置的汇编指令:str r5, [r2, #24],
查看死机时的r2=0x1100b2cc,r5=r2+24=0x1100b2e4(非法地址),正好和死机时的非法访问0x1100b2e4,对的上,说明时r2寄存器的值异常造成的。 -
再往上看r2的值为什么会出错,汇编指令:mla r2, r2, r3, r4,
r2 = r4 + r2 * r3 = 0x1100a144 + 2244 * 0x00000002 = 0x1100b2cc,这都对的上,这里r2取的是立即数2244,说明是r4或r3的值异常造成的。 -
再往上看,汇编指令:ldr r4, [pc, #140],
r4 = &(0x5c7d4+140) = &(0x5c864) = 0x1000a144,而实际r4 = 0x1100a144,有1bit不对,此地址在code区域,怀疑发生了位翻转。
综上分析:初步分析怀疑是code区域,1bit发生位翻转。
后面有调回机器,GDB读取0x5c864地址的值,确认code区域出问题了:
最后反馈给原厂,确认是芯片不良,概率为0.01%,后续批次的芯片已解决。
最后:文档中如有不对的地方,欢迎指正,谢谢!