Investigating core panics on the LoPy4

Dear Pascal,

sorry, there’s no full core dump. Have you been able to catch one?

With kind regards,
Andreas.

Hi @Andreas

Thanks for your response.
Unfortunately there is no core dump printed.
All crashes just result in above messages.
Is there a possibility to extend the log to print the desired dump?

Many thanks and Kind Regards
Pascal

Hmm, sorry for that!

There is a compile-time flag to turn off printing core dumps. However, with the dragonfly builds, it should be turned on.

I don’t have a clue why no core dump is printed when your device observes a core panic.

Hi @Andreas

Thanks a lot for your message and all your investment trying to help solving this problem.
I am open to help in any way to also improve the long term stability.

Do you see any chance I can support with these two approaches:

  • Debugging the device? Is there any profound way to do this on my own?
  • Recompile the pycom firmware and esp32 code with
CONFIG_ESP32_PANIC_SILENT_REBOOT

instead of

CONFIG_ESP32_PANIC_PRINT_REBOOT

to check whether the device restarts, when no message is printed? I know that this is not a solution, because the actual cause needs to be fixed. For us it would just be not so dramatic if the device restarts properly when a core panic happens.

Many thanks and best regards, Pascal

Hi @Andreas

Short question.
Do you know whether these long term stability issues and core panics is recognized by pycom and the esp32 community?

It seems to be a big topic for many people.
To me it is unclear whether and when we can expect any improvements.

Do you know something about that?

Thanks a lot in advance and Best Regards, Pascal

In my setup using the current Terkin version from the master branch in LoRaWAN mode and sending payloads every 5 minutes I’ve not seen any crashes since I put the LoPy4-1.20.1.r1-0.7.0-vanilla-dragonfly-onewire firmware onto the device a couple of weeks ago. Terkin nowadays also uses the nvram_save/nvram_restore capabilities and everything works as expected and super stable.

Please try giving the device some rest after the s.send call before invoking s.recv, e.g. 5 seconds. Also, setblocking is always set to False in our Terkin routines. I remember people running into trouble with a blocking socket.

2 Likes

Dear @Thias

Thank you so much for your suggestion. I have tried it out. Unfortunately the device is still crashing.

I think now, it is more related to a second task run by Timer.Alarm(), which results in a Stack Size or Memory Size problem inside the ESP32. I make now an automated test without synchronizing the Device Time in intervals with Timer.Alarm(). If this works, we restart the device once a day, synchronize the time at boot and I think we are kind of fine.

Thanks for all the help you guys.

Best Regards, Pascal

1 Like

Dear Pascal,

If you are able to share some code we could use to reproduce this error, we might be able to look into this. As we’ve invested a considerable amount of time into getting a stable firmware, I would be interested to investigate this further, if time permits.

Oh, you are not alone. Others might be observing the same thing.

With kind regards,
Andreas.

It looks like they succeeded after switching to Dragonfly.

So, we are looking at the next observations of core dumps on LoPy4 devices.

https://forum.pycom.io/topic/5423/core-dump-problem

Dear @pascalschaefer,

@sita on the Pycom user forum (see above: https://forum.pycom.io/topic/5560/guru-meditation-error/13) was able to get stable runtime behavior using the Dragonfly firmware on two LoPy4 devices. However, it took him/her two attempts to do so. So, I am humbly asking if you might want to try again.

Would you be able to share more of the corresponding MicroPython code in order to reproduce the crash or to just inspect where the error might be originating from?

Still no luck with core dumps?

With kind regards,
Andreas.

Dear @Andreas

Thanks for your message.
My company doesn’t allow me to publish the full source.
I am building currently a minimal version to reproduce the issue.

Best Regards, Pascal.

Hi @Andreas

I have prepared the source to reproduce the core panic.
Because I can’t upload the zip file, I will send it to your email address.

When you start the device it happens mostly after 2-3 Minutes.
Sometimes not, maybe you need 2-3 tries.

In main.py you need to update the options json properties to insert your lora_app_eui / lora_app_key . Mode is otaa but can be changed to abp.

After 10 seconds, 10 Events are added and 25 Seconds later the time synchronization starts.
After some time the core panic happens.

In the package is also an example log of the core panic.

Please let me know if you have questions or whether it also works on your side.
Used setup:

(sysname='LoPy4', nodename='LoPy4', release='1.20.1.r1-0.3.0-vanilla-psramfix-unicore', version='69dd8b5d-dirty on 2019-11-05', machine='LoPy4 with ESP32', lorawan='1.0.2', sigfox='1.0.1')

Thanks for all your support
Best Regards, Pascal

You did not try the dragonfly variant which is supposed to fix the core panics yet?

HI @Thias

oops, seems I have installed a wrong version. Will do the retest right away

Hi @Thias

I have tested again with (sysname='LoPy4', nodename='LoPy4', release='1.20.1.r1-0.7.0-vanilla-dragonfly-onewire-i2s', version='daf40f36-dirty on 2019-12-04', machine='LoPy4 with ESP32', lorawan='1.0.2', sigfox='1.0.1')
After the third round (waiting 2-3 minutes, if no corepanic, then restart device)
the core panic happens again.

Are you also interested in the source?

Hi @Thias @Andreas

I have published the source in github:

No. I leave it to Andreas

This looks interesting

Guru Meditation Error: Core  1 panic'ed (Cache disabled but cached memory region accessed)
Guru Meditation Error: Core  1 panic'ed (IllegalInstruction). Exception was unhandled.

I found this discussion https://github.com/espressif/arduino-esp32/issues/855 , perhaps it helps.

Dear Pascal,

thanks for sharing GitHub - pascalschaefer/investigating-core-panics. However, it is not quite a small code base.

May I ask whether you are using FatFS or LittleFS already?

With kind regards,
Andreas.

1 Like

HI @Andreas

Yes we are using LiffleFS already.

Hi @clemens

Thanks a lot for the link. Once the Timer.Alarm gets called, indeed a lot of Code is executed.
I don’t have enough knowledge what is going on under the hood in detail, but the approaches in the link makes sense. So in our case it is maybe more a interrupt related problem. I will try out. I will enable a flag and execute the Time Synchronization in the Main loop via flags and check whether it helps.

Thanks for all your help.

Best Regards, Pascal