Don't invoke MicroPython's "m_malloc" within C-level RTOS extensions

Andreas · November 12, 2019, 11:01pm

Re. Investigating random core panics on Pycom/ESP32 devices.

Finally decoding some core dumps and reading the sources gave us a clue what might be going on here.

We recognized the core panic was occurring regularly when the program was running the garbage collector while at the same time the system was under significant network and memory pressure by uploading files through FTP.

  131.4430 [terkin.device            ] INFO   : Start curating the garbage collector
  131.6197 [terkin.device            ] INFO   : Collecting garbage
Guru Meditation Error: Core  1 panic'ed (LoadProhibited). Exception was unhandled.
Core 1 register dump:
PC      : 0x400f65c6  PS      : 0x00060c30  A0      : 0x800fe5c2  A1      : 0x3ffdbb30
A2      : 0x3ffc92e8  A3      : 0x3f40dc73  A4      : 0x3ffc92d0  A5      : 0x00000000
A6      : 0x3ffccf14  A7      : 0x00000000  A8      : 0x800f65c6  A9      : 0x3ffdbb10
A10     : 0x00000000  A11     : 0x00000000  A12     : 0x3f410014  A13     : 0x00000000
A14     : 0x3f40ea20  A15     : 0x3ffcf730  SAR     : 0x0000000d  EXCCAUSE: 0x0000001c
EXCVADDR: 0x00000010  LBEG    : 0x4009895d  LEND    : 0x40098981  LCOUNT  : 0x800f6361

Backtrace: 0x400f65c6:0x3ffdbb30 0x400fe5bf:0x3ffdbb50 0x400fe5d6:0x3ffdbb70 0x400f65fe:0x3ffdbb90 0x400f3550:0x3ffdbbb0 0x400f35a4:0x3ffdbbd0 0x400f4ee5:0x3ffdc010 0x400f59d8:0x3ffdc030 0x400f5ee0:0x3ffdc480 0x400e08d0:0x3ffdc4b0

The core dump gives us a pointer regarding "memory allocation failed, heap is locked" where the stacktrace shows the fault has been triggered from within the Pycom-specific FTP-server implementation serverstask.c and ftp.c.

#0  nlr_jump (val=0x3ffc92e8 <mp_state_ctx+24>) at ../py/nlrsetjmp.c:33
#1  0x400fe5ae in mp_raise_msg (exc_type=0x3f40ea14 <mp_type_MemoryError>, msg=0x3f40dc67 \"memory allocation failed, heap is locked\") at ../py/runtime.c:1465
#2  0x400fe5c5 in m_malloc_fail (num_bytes=28) at ../py/runtime.c:1454
#3  0x400f65ed in m_malloc (num_bytes=28) at ../py/malloc.c:88
#4  0x400f4fa7 in f_readdir_helper (fno=0x3ffdbeb0, dp=0x3ffc9204 <ftp_data+8>) at ftp/ftp.c:341
#5  0x400f51ff in ftp_list_dir (list=0x3ffdd960 <error: Cannot access memory at address 0x3ffdd960>, listsize=0x3ffdc310, maxlistsize=512) at ftp/ftp.c:1455
#6  0x400f5ee1 in ftp_run () at ftp/ftp.c:672
#7  0x400e08d3 in TASK_Servers (pvParameters=<optimized out>) at serverstask.c:105

Reading on the MicroPython issues and sources, there is

So, this seems to be related to allocating memory from within interrupt handlers, something which Writing interrupt handlers — MicroPython 1.12 documentation recommends against.

However, slowly. After reading through the Pycom sources and comparing them with the sources from Genuine MicroPython, we see various invocations of m_malloc() from within C-level code. m_malloc() in turn invokes malloc(), which will get redirected to gc_alloc().

github.com

micropython/micropython/blob/v1.11/py/malloc.c#L85-L97


void *m_malloc(size_t num_bytes) {
    void *ptr = malloc(num_bytes);
    if (ptr == NULL && num_bytes != 0) {
        m_malloc_fail(num_bytes);
    }
#if MICROPY_MEM_STATS
    MP_STATE_MEM(total_bytes_allocated) += num_bytes;
    MP_STATE_MEM(current_bytes_allocated) += num_bytes;
    UPDATE_PEAK();
#endif
    DEBUG_printf("malloc %d : %p\n", num_bytes, ptr);
    return ptr;
}

github.com

micropython/micropython/blob/v1.11/py/malloc.c#L51-L63


// We redirect standard alloc functions to GC heap - just for the rest of
// this module. In the rest of MicroPython source, system malloc can be
// freely accessed - for interfacing with system and 3rd-party libs for
// example. On the other hand, some (e.g. bare-metal) ports may use GC
// heap as system heap, so, to avoid warnings, we do undef's first.
#undef malloc
#undef free
#undef realloc
#define malloc(b) gc_alloc((b), false)
#define malloc_with_finaliser(b) gc_alloc((b), true)
#define free gc_free
#define realloc(ptr, n) gc_realloc(ptr, n, true)
#define realloc_ext(ptr, n, mv) gc_realloc(ptr, n, mv)

While this is probably right for allocating memory for/from the MicroPython interpreter, it might not be the way to go for C-level extensions like implemented within serverstask.c and friends. Why is that? Allocating memory from within a C-level extension should not mess with memory the MicroPython garbage collector is responsible for at all, right?

This is probably the reason this error is so difficult to reproduce: While everything will be fine in most cases, the system will panic when some C-level extension tries to allocate memory while the garbage collector is active in MicroPython land. In this case, memory allocation will just completely be rejected due to the policy of allocating memory the garbage collector should be responsible for.

The total time of activity of the garbage collector probably depends on the size of the MicroPython program running on the system with respect to memory allocation of runtime objects as well as memory fragmentation coming from the number and diversity of the Python modules loaded by the program.

If resource pressure coming from different directions of these aspects occurs at the same time and a C-level extension gets scheduled while the garbage collector is still running, the shit just hits the fan.

Andreas · November 12, 2019, 11:31pm

After replacing m_malloc() through pvPortMalloc() and m_free() through vPortFree() within Pycom’s ftp.c, we see m_malloc() is even used for allocating memory within the LittleFS implementation, vfs_littlefs.c. Gosh!

#0  nlr_jump (val=0x3ffc92e8 <mp_state_ctx+24>) at ../py/nlrsetjmp.c:33
#1  0x400fe5c2 in mp_raise_msg (exc_type=0x3f40ea20 <mp_type_MemoryError>, msg=0x3f40dc73 \"memory allocation failed, heap is locked\") at ../py/runtime.c:1465
#2  0x400fe5d9 in m_malloc_fail (num_bytes=12) at ../py/runtime.c:1454
#3  0x400f6601 in m_malloc (num_bytes=12) at ../py/malloc.c:88
#4  0x400f3553 in littlefs_prepare_attributes (cfg=0x3ffc9258 <ftp_data+92>) at littlefs/vfs_littlefs.c:357
#5  0x400f35a7 in littlefs_open_common_helper (lfs=0x3ffc46a8 <sflash_vfs_flash+56>, file_path=0x3ffddb82 <error: Cannot access memory at address 0x3ffddb82>, fp=0x3ffc9204 <ftp_data+8>, mode=1282, cfg=0x3ffc9258 <ftp_data+92>, timestamp_update_ptr=0x3ffc9264 <ftp_data+104>) at littlefs/vfs_littlefs.c:283
#6  0x400f4ee8 in f_open_helper (fp=0x3ffc9204 <ftp_data+8>, mode=10 '\
', path=<optimized out>) at ftp/ftp.c:242
#7  ftp_open_file (path=<optimized out>, mode=10) at ftp/ftp.c:1392
#8  0x400f59db in ftp_process_cmd () at ftp/ftp.c:1186
#9  0x400f5ee3 in ftp_run () at ftp/ftp.c:659
#10 0x400e08d3 in TASK_Servers (pvParameters=<optimized out>) at serverstask.c:105

Andreas · November 13, 2019, 1:05am

So, we continued scanning through the code replacing all calls to these functions as well as some other direct calls to gc_malloc() elsewhere through their regular variants not involving the MicroPython’s garbage collector in the hope we wouldn’t break things.

Now, we are running FiPy-1.20.1.r1-0.6.0-vanilla-dragonfly.tar.gz and things feel pretty damn stable.

Andreas · February 26, 2020, 6:01pm

To be continued

This story is not over. People are still observing random core panics, specifically when using alarm timers and such.

So, we tried to investigate again.

Observations

m_malloc is used for allocating memory within the mDNS implementation added the other day.

ag --case-sensitive m_malloc esp32/

esp32/mods/modmdns.c
171:            service_txt = m_malloc(length_total * sizeof(mdns_txt_item_t));

gc_alloc is used for acquiring memory for mutexes, alarm timers and the pwm implementation.

$ ag --case-sensitive gc_alloc esp32/

esp32/mpthreadport.c
283:    self->mutex = gc_alloc(sizeof(mp_thread_mutex_t), false);

esp32/mods/machtimer_alarm.c
59:    MP_STATE_PORT(mp_alarm_heap) = gc_alloc(ALARM_HEAP_MAX_ELEMENTS * sizeof(mp_obj_alarm_t *), false);

esp32/mods/machpwm.c
106:        pwm->mach_pwm_channel_obj_t[pwm_channel_id] = gc_alloc(sizeof(mach_pwm_channel_obj_t), false);
194:        MP_STATE_PORT(mach_pwm_timer_obj[pwm_timer_id]) = gc_alloc(sizeof(mach_pwm_timer_obj_t), false);

Summary

While 1. is probably only relevant when actually using mDNS, 2. is really really sad as it will come into play when just using threads, alarm timers, or the PWM implementation.