Investigating core panics on the LoPy4

HI @Thias

oops, seems I have installed a wrong version. Will do the retest right away

Hi @Thias

I have tested again with (sysname='LoPy4', nodename='LoPy4', release='1.20.1.r1-0.7.0-vanilla-dragonfly-onewire-i2s', version='daf40f36-dirty on 2019-12-04', machine='LoPy4 with ESP32', lorawan='1.0.2', sigfox='1.0.1')
After the third round (waiting 2-3 minutes, if no corepanic, then restart device)
the core panic happens again.

Are you also interested in the source?

Hi @Thias @Andreas

I have published the source in github:

No. I leave it to Andreas

This looks interesting

Guru Meditation Error: Core  1 panic'ed (Cache disabled but cached memory region accessed)
Guru Meditation Error: Core  1 panic'ed (IllegalInstruction). Exception was unhandled.

I found this discussion https://github.com/espressif/arduino-esp32/issues/855 , perhaps it helps.

Dear Pascal,

thanks for sharing GitHub - pascalschaefer/investigating-core-panics. However, it is not quite a small code base.

May I ask whether you are using FatFS or LittleFS already?

With kind regards,
Andreas.

1 Like

HI @Andreas

Yes we are using LiffleFS already.

Hi @clemens

Thanks a lot for the link. Once the Timer.Alarm gets called, indeed a lot of Code is executed.
I don’t have enough knowledge what is going on under the hood in detail, but the approaches in the link makes sense. So in our case it is maybe more a interrupt related problem. I will try out. I will enable a flag and execute the Time Synchronization in the Main loop via flags and check whether it helps.

Thanks for all your help.

Best Regards, Pascal

Hi @Andreas @clemens

For your information, I have modified the code to only set a flag in the Timer.Alarm part and move the logic to the main threat. Because the Core Dump is still thrown, we can exclude a problem related to interrupts. Next point I can imaging is the main threat and the worker threat for sending events in parallel. Maybe a stack size problem. Memory I have checked, seems to be fine.

_thread.stack_size(64536)
self._publisherThread = _thread.start_new_thread(self.sendPendingEvents, ())
_thread.stack_size(0) 

Best Regards, Pascal

Just a note: See also Firmware Release v1.20.1 | Pycom user forum ff. re. LoRa-Stack intrinsics, respective OnRadioRx/OnRadioTx interrupts and timer interrupts.

Probably also related.

Dear Pascal,

I am only now seeing that you are actually using pure LoRa instead of LoRaWAN.

While investigating another Guru Meditation Error | Pycom user forum, I just found this piece within the Listen-before-Talk (LBT) implementation.

Maybe LoPy4-1.20.1.r3-0.8.0-vanilla-squirrel-unicore.tar.gz works better within this scenario.

With kind regards,
Andreas.


Edit:

Of course, this is nonsense. While the LoRa interface is initialized that way first, it will get reinitialized later to use LoRaWAN:

Dear Pascal,

the Pycom engineers recently published Pycom Firmware Release 1.20.2 (thanks!). I have spotted two updates specifically related to LoRa.

If we are lucky, this fixes the "bad00bad bad00bad bad00bad" errors you have been observing. So, I am humbly asking you to try LoPy4-1.20.2.rc3-0.8.0-vanilla-squirrel.tar.gz in order to find out. I will be happy to receive corresponding core dumps.

Thanks already and with kind regards,
Andreas.

2 Likes

Dear @Andreas

Thanks for all your help. Today is a good day!

I tried the new build LoPy4-1.20.2.rc3-0.8.0-vanilla-squirrel.tar.gz and not any "bad00bad bad00bad bad00bad" Core Panic happened till now.

All tests which produced earlier in all cases a Core Panic, are running now successfully without any crash. Also the full firmware doesn’t crash anymore and is running stable with intensively usage of adding Events and Time Synchronization since 2 hours. I let the device run to get some long term observation.

Many thanks to you and all people involved!

Best Regards, Pascal

2 Likes

Dear @Andreas

Thank you for the work being done to track down the cause for the random core panics.

We are currently running a LoPy 4 on Pycom’s v1.20.0.rc13 firmware. We are experiencing random core panics after a Lora packet is sent. Usually, the packet is successfully sent, but immediately afterwards the core panic occurrs. We have been able to catch a couple of core dumps, which unfortunately I cannot attach to this post because I am a new user. In what way could I send these to you?

Here is the version currently running on our LoPy4: (sysname=‘LoPy4’, nodename=‘LoPy4’, release=‘1.20.0.rc13’, version=‘v1.9.4-94bb382 on 2019-08-22’, machine=‘LoPy4 with ESP32’, lorawan=‘1.0.2’, sigfox=‘1.0.1’)

Here is the first section printed out after the core panic:

Guru Meditation Error: Core  1 panic'ed (Cache disabled but cached memory region accessed)
Core 1 register dump:
PC      : 0x40114bb0  PS      : 0x00060734  A0      : 0x80085755  A1      : 0x3ffc1360  
A2      : 0x00000002  A3      : 0x3ffcb728  A4      : 0x00000000  A5      : 0x00dbaa05  
A6      : 0x3ffcc784  A7      : 0x00000001  A8      : 0x80084a0e  A9      : 0x3ffc1340  
A10     : 0x00dbaa05  A11     : 0x3ffc1361  A12     : 0x3ffc1361  A13     : 0x3ffcc7e0  
A14     : 0x3ffcc7d0  A15     : 0x3ffae270  SAR     : 0x0000000e  EXCCAUSE: 0x00000007  
EXCVADDR: 0x00000000  LBEG    : 0x4009c146  LEND    : 0x4009c155  LCOUNT  : 0x00000000  
Core 1 was running in ISR context:
EPC1    : 0x40084507  EPC2    : 0x00000000  EPC3    : 0x00000000  EPC4    : 0x40114bb0`

Do you know if Pycom’s new firmware, v1.20.2.rc3, might help to mitigate these issues?

Dear Dan,

Pycom just released 1.20.2.rc6. We’ve built upon that and updated our series of Squirrel firmware for Pycom/ESP32. So, please try the latest and greatest LoPy4-1.20.2.rc6-0.10.0-vanilla-squirrel.tar.gz while reading the installation instructions carefully.

You will be able to submit a core dump by putting it into a .txt file and attaching it to a post by just drag & drop. If this doesn’t work, please use the private messaging feature. Please also make sure you are submitting the exact version of the firmware you have been using to capture the core dump.

With kind regards,
Andreas.

1 Like

@Andreas thank you. Is there specific advantages to using LoPy4-1.20.2.rc6-0.10.0-vanilla-squirrel.tar.gz over Pycom’s own 1.20.2.rc6? Or even 1.20.0.rc3? I am asking this because it would be easier for us, for production purposes, to be able to use Pycom’s firmware as we must flash this on dozens of devices.

I will however try both, and come back with results when I have completed the tests. The core panics are somewhat infrequent in our case, sometimes happening after at least 12 hours of operation. And as I said, they appear to occurr always after a LoRa packet is sent.

Best wishes,

Dan

Yes there are. I’ve outlined some details at about the Squirrel builds already and I am currently investigating more things where others are still reporting about occasional core panics.

Thanks!

I hear you, others are reporting similar things. It’s sad but true!

1 Like

Dear @Andreas

We have been testing Pycom’s own v1.20.2.rc6 (not your Squirrel builds yet) on our LoPy4, and I am pleased to inform you that for the past 2 days that we have had a test unit running, we have not encountered any random core panics/resets. In the past, two days of continuous opertion would cause at least 1 random core panic, if not a couple.

One important change we had to do to the code was to add a 10 ms delay within our main loop, because when moving from v1.20.0.rc13 to v1.20.2.rc6, threads did not execute if the main thread didn’t yield for a short time at least (see [1]).

I don’t know, then, if the cause for the random core panics was us not having this short delay in our code, or the new firmware version, or a combination of both. Regardless, we are content.

One last question: Would you be able to shed some light on the problems/bugs Pycom’s implementation of LittleFS appears to have? We are thinking of moving to this filesystem (we are using FatFS in all our units) because we have noticed that data in NVRAM and the SD card is getting corrupted (and possibly causing the LoPy4 to hang infinitely), but we are skeptical of making the change because you had mentioned bugs in Pycom’s implementation of LittleFS. Any light you could shed on this?

Thanks in advance!

Best wishes,

Dan

[1] Main thread blocks auxiliary threads (_thread module) | Pycom user forum

1 Like

Hi @d.alvrzx,

Thanks for letting us know!

Thanks again for sharing your insights!

Sure. We have listed all changes on Squirrel firmware for Pycom/ESP32. The mitigations specific to LittleFS can be found at Fix core panics: Don't use m_malloc and gc_malloc unintended by amotl · Pull Request #418 · pycom/pycom-micropython-sigfox · GitHub.

With kind regards,
Andreas.

1 Like

Dear @Andreas

As always, I’d like to thank you for your continued support. We are now in the middle of a 1 week stability test to see if we can run throughout this time without any resets. Hoping for the best.

In any case, we’ll keep in touch. Thanks for everything.

Best regards,

Dan