Investigating core panics on the LoPy4

pascalschaefer · January 3, 2020, 1:09pm

Thanks a lot for your message and all your investment trying to help solving this problem.
I am open to help in any way to also improve the long term stability.

Do you see any chance I can support with these two approaches:

Debugging the device? Is there any profound way to do this on my own?
Recompile the pycom firmware and esp32 code with

CONFIG_ESP32_PANIC_SILENT_REBOOT

instead of

CONFIG_ESP32_PANIC_PRINT_REBOOT

to check whether the device restarts, when no message is printed? I know that this is not a solution, because the actual cause needs to be fixed. For us it would just be not so dramatic if the device restarts properly when a core panic happens.

Many thanks and best regards, Pascal

pascalschaefer · January 7, 2020, 8:24am

Hi @Andreas

Short question.
Do you know whether these long term stability issues and core panics is recognized by pycom and the esp32 community?

It seems to be a big topic for many people.
To me it is unclear whether and when we can expect any improvements.

Do you know something about that?

Thanks a lot in advance and Best Regards, Pascal

Thias · January 7, 2020, 11:18am

In my setup using the current Terkin version from the master branch in LoRaWAN mode and sending payloads every 5 minutes I’ve not seen any crashes since I put the LoPy4-1.20.1.r1-0.7.0-vanilla-dragonfly-onewire firmware onto the device a couple of weeks ago. Terkin nowadays also uses the nvram_save/nvram_restore capabilities and everything works as expected and super stable.

Please try giving the device some rest after the s.send call before invoking s.recv, e.g. 5 seconds. Also, setblocking is always set to False in our Terkin routines. I remember people running into trouble with a blocking socket.

pascalschaefer · January 7, 2020, 12:29pm

Dear @Thias

Thank you so much for your suggestion. I have tried it out. Unfortunately the device is still crashing.

I think now, it is more related to a second task run by Timer.Alarm(), which results in a Stack Size or Memory Size problem inside the ESP32. I make now an automated test without synchronizing the Device Time in intervals with Timer.Alarm(). If this works, we restart the device once a day, synchronize the time at boot and I think we are kind of fine.

Thanks for all the help you guys.

Best Regards, Pascal

Andreas · January 7, 2020, 9:34pm

Dear Pascal,

If you are able to share some code we could use to reproduce this error, we might be able to look into this. As we’ve invested a considerable amount of time into getting a stable firmware, I would be interested to investigate this further, if time permits.

Oh, you are not alone. Others might be observing the same thing.

With kind regards,
Andreas.

Andreas · January 9, 2020, 11:37pm

It looks like they succeeded after switching to Dragonfly.

So, we are looking at the next observations of core dumps on LoPy4 devices.

https://forum.pycom.io/topic/5423/core-dump-problem

Andreas · January 9, 2020, 11:44pm

Dear @pascalschaefer,

@sita on the Pycom user forum (see above: https://forum.pycom.io/topic/5560/guru-meditation-error/13) was able to get stable runtime behavior using the Dragonfly firmware on two LoPy4 devices. However, it took him/her two attempts to do so. So, I am humbly asking if you might want to try again.

Would you be able to share more of the corresponding MicroPython code in order to reproduce the crash or to just inspect where the error might be originating from?

Still no luck with core dumps?

With kind regards,
Andreas.

pascalschaefer · January 13, 2020, 9:17am

Dear @Andreas

Thanks for your message.
My company doesn’t allow me to publish the full source.
I am building currently a minimal version to reproduce the issue.

Best Regards, Pascal.

pascalschaefer · January 13, 2020, 2:16pm

Hi @Andreas

I have prepared the source to reproduce the core panic.
Because I can’t upload the zip file, I will send it to your email address.

When you start the device it happens mostly after 2-3 Minutes.
Sometimes not, maybe you need 2-3 tries.

In main.py you need to update the options json properties to insert your lora_app_eui / lora_app_key . Mode is otaa but can be changed to abp.

After 10 seconds, 10 Events are added and 25 Seconds later the time synchronization starts.
After some time the core panic happens.

In the package is also an example log of the core panic.

Please let me know if you have questions or whether it also works on your side.
Used setup:

(sysname='LoPy4', nodename='LoPy4', release='1.20.1.r1-0.3.0-vanilla-psramfix-unicore', version='69dd8b5d-dirty on 2019-11-05', machine='LoPy4 with ESP32', lorawan='1.0.2', sigfox='1.0.1')

Thanks for all your support
Best Regards, Pascal

Thias · January 13, 2020, 2:22pm

You did not try the dragonfly variant which is supposed to fix the core panics yet?

pascalschaefer · January 13, 2020, 2:24pm

HI @Thias

oops, seems I have installed a wrong version. Will do the retest right away

pascalschaefer · January 13, 2020, 2:37pm

Hi @Thias

I have tested again with (sysname='LoPy4', nodename='LoPy4', release='1.20.1.r1-0.7.0-vanilla-dragonfly-onewire-i2s', version='daf40f36-dirty on 2019-12-04', machine='LoPy4 with ESP32', lorawan='1.0.2', sigfox='1.0.1')
After the third round (waiting 2-3 minutes, if no corepanic, then restart device)
the core panic happens again.

Are you also interested in the source?

pascalschaefer · January 13, 2020, 2:43pm

Hi @Thias @Andreas

I have published the source in github:

Thias · January 13, 2020, 3:17pm

No. I leave it to Andreas

clemens · January 13, 2020, 5:14pm

This looks interesting

Guru Meditation Error: Core  1 panic'ed (Cache disabled but cached memory region accessed)
Guru Meditation Error: Core  1 panic'ed (IllegalInstruction). Exception was unhandled.

I found this discussion https://github.com/espressif/arduino-esp32/issues/855 , perhaps it helps.

Andreas · January 14, 2020, 2:23am

Dear Pascal,

thanks for sharing GitHub - pascalschaefer/investigating-core-panics. However, it is not quite a small code base.

May I ask whether you are using FatFS or LittleFS already?

With kind regards,
Andreas.

pascalschaefer · January 14, 2020, 11:01am

HI @Andreas

Yes we are using LiffleFS already.

Hi @clemens

Thanks a lot for the link. Once the Timer.Alarm gets called, indeed a lot of Code is executed.
I don’t have enough knowledge what is going on under the hood in detail, but the approaches in the link makes sense. So in our case it is maybe more a interrupt related problem. I will try out. I will enable a flag and execute the Time Synchronization in the Main loop via flags and check whether it helps.

Thanks for all your help.

Best Regards, Pascal

pascalschaefer · January 14, 2020, 12:00pm

Hi @Andreas @clemens

For your information, I have modified the code to only set a flag in the Timer.Alarm part and move the logic to the main threat. Because the Core Dump is still thrown, we can exclude a problem related to interrupts. Next point I can imaging is the main threat and the worker threat for sending events in parallel. Maybe a stack size problem. Memory I have checked, seems to be fine.

_thread.stack_size(64536)
self._publisherThread = _thread.start_new_thread(self.sendPendingEvents, ())
_thread.stack_size(0)

Best Regards, Pascal

Andreas · January 14, 2020, 6:37pm

Just a note: See also Firmware Release v1.20.1 | Pycom user forum ff. re. LoRa-Stack intrinsics, respective OnRadioRx/OnRadioTx interrupts and timer interrupts.

Andreas · January 15, 2020, 3:21pm

Probably also related.