Re. Investigating random core panics on Pycom/ESP32 devices.
Finally decoding some core dumps and reading the sources gave us a clue what might be going on here.
We recognized the core panic was occurring regularly when the program was running the garbage collector while at the same time the system was under significant network and memory pressure by uploading files through FTP.
131.4430 [terkin.device ] INFO : Start curating the garbage collector
131.6197 [terkin.device ] INFO : Collecting garbage
Guru Meditation Error: Core 1 panic'ed (LoadProhibited). Exception was unhandled.
Core 1 register dump:
PC : 0x400f65c6 PS : 0x00060c30 A0 : 0x800fe5c2 A1 : 0x3ffdbb30
A2 : 0x3ffc92e8 A3 : 0x3f40dc73 A4 : 0x3ffc92d0 A5 : 0x00000000
A6 : 0x3ffccf14 A7 : 0x00000000 A8 : 0x800f65c6 A9 : 0x3ffdbb10
A10 : 0x00000000 A11 : 0x00000000 A12 : 0x3f410014 A13 : 0x00000000
A14 : 0x3f40ea20 A15 : 0x3ffcf730 SAR : 0x0000000d EXCCAUSE: 0x0000001c
EXCVADDR: 0x00000010 LBEG : 0x4009895d LEND : 0x40098981 LCOUNT : 0x800f6361
Backtrace: 0x400f65c6:0x3ffdbb30 0x400fe5bf:0x3ffdbb50 0x400fe5d6:0x3ffdbb70 0x400f65fe:0x3ffdbb90 0x400f3550:0x3ffdbbb0 0x400f35a4:0x3ffdbbd0 0x400f4ee5:0x3ffdc010 0x400f59d8:0x3ffdc030 0x400f5ee0:0x3ffdc480 0x400e08d0:0x3ffdc4b0
The core dump gives us a pointer regarding "memory allocation failed, heap is locked"
where the stacktrace shows the fault has been triggered from within the Pycom-specific FTP-server implementation serverstask.c
and ftp.c
.
#0 nlr_jump (val=0x3ffc92e8 <mp_state_ctx+24>) at ../py/nlrsetjmp.c:33
#1 0x400fe5ae in mp_raise_msg (exc_type=0x3f40ea14 <mp_type_MemoryError>, msg=0x3f40dc67 \"memory allocation failed, heap is locked\") at ../py/runtime.c:1465
#2 0x400fe5c5 in m_malloc_fail (num_bytes=28) at ../py/runtime.c:1454
#3 0x400f65ed in m_malloc (num_bytes=28) at ../py/malloc.c:88
#4 0x400f4fa7 in f_readdir_helper (fno=0x3ffdbeb0, dp=0x3ffc9204 <ftp_data+8>) at ftp/ftp.c:341
#5 0x400f51ff in ftp_list_dir (list=0x3ffdd960 <error: Cannot access memory at address 0x3ffdd960>, listsize=0x3ffdc310, maxlistsize=512) at ftp/ftp.c:1455
#6 0x400f5ee1 in ftp_run () at ftp/ftp.c:672
#7 0x400e08d3 in TASK_Servers (pvParameters=<optimized out>) at serverstask.c:105
Reading on the MicroPython issues and sources, there is
- Unable to read from i2c in an interrupt handler? · Issue #2859 · micropython/micropython · GitHub
- pyb.Switch() Memory Error External Interrupt - MicroPython Forum
- Allow real memory errors (from locked gc) to be reported with traceback. by dhylands · Pull Request #894 · micropython/micropython · GitHub
So, this seems to be related to allocating memory from within interrupt handlers, something which Writing interrupt handlers — MicroPython 1.12 documentation recommends against.
However, slowly. After reading through the Pycom sources and comparing them with the sources from Genuine MicroPython, we see various invocations of m_malloc()
from within C-level code. m_malloc()
in turn invokes malloc()
, which will get redirected to gc_alloc()
.
While this is probably right for allocating memory for/from the MicroPython interpreter, it might not be the way to go for C-level extensions like implemented within serverstask.c
and friends. Why is that? Allocating memory from within a C-level extension should not mess with memory the MicroPython garbage collector is responsible for at all, right?
This is probably the reason this error is so difficult to reproduce: While everything will be fine in most cases, the system will panic when some C-level extension tries to allocate memory while the garbage collector is active in MicroPython land. In this case, memory allocation will just completely be rejected due to the policy of allocating memory the garbage collector should be responsible for.
The total time of activity of the garbage collector probably depends on the size of the MicroPython program running on the system with respect to memory allocation of runtime objects as well as memory fragmentation coming from the number and diversity of the Python modules loaded by the program.
If resource pressure coming from different directions of these aspects occurs at the same time and a C-level extension gets scheduled while the garbage collector is still running, the shit just hits the fan.