Thanks a lot for your message and all your investment trying to help solving this problem.
I am open to help in any way to also improve the long term stability.
Do you see any chance I can support with these two approaches:
Debugging the device? Is there any profound way to do this on my own?
Recompile the pycom firmware and esp32 code with
CONFIG_ESP32_PANIC_SILENT_REBOOT
instead of
CONFIG_ESP32_PANIC_PRINT_REBOOT
to check whether the device restarts, when no message is printed? I know that this is not a solution, because the actual cause needs to be fixed. For us it would just be not so dramatic if the device restarts properly when a core panic happens.
In my setup using the current Terkin version from the master branch in LoRaWAN mode and sending payloads every 5 minutes I’ve not seen any crashes since I put the LoPy4-1.20.1.r1-0.7.0-vanilla-dragonfly-onewire firmware onto the device a couple of weeks ago. Terkin nowadays also uses the nvram_save/nvram_restore capabilities and everything works as expected and super stable.
Please try giving the device some rest after the s.send call before invoking s.recv, e.g. 5 seconds. Also, setblocking is always set to False in our Terkin routines. I remember people running into trouble with a blocking socket.
Thank you so much for your suggestion. I have tried it out. Unfortunately the device is still crashing.
I think now, it is more related to a second task run by Timer.Alarm(), which results in a Stack Size or Memory Size problem inside the ESP32. I make now an automated test without synchronizing the Device Time in intervals with Timer.Alarm(). If this works, we restart the device once a day, synchronize the time at boot and I think we are kind of fine.
If you are able to share some code we could use to reproduce this error, we might be able to look into this. As we’ve invested a considerable amount of time into getting a stable firmware, I would be interested to investigate this further, if time permits.
Oh, you are not alone. Others might be observing the same thing.
@sita on the Pycom user forum (see above: https://forum.pycom.io/topic/5560/guru-meditation-error/13) was able to get stable runtime behavior using the Dragonfly firmware on two LoPy4 devices. However, it took him/her two attempts to do so. So, I am humbly asking if you might want to try again.
Would you be able to share more of the corresponding MicroPython code in order to reproduce the crash or to just inspect where the error might be originating from?
I have tested again with (sysname='LoPy4', nodename='LoPy4', release='1.20.1.r1-0.7.0-vanilla-dragonfly-onewire-i2s', version='daf40f36-dirty on 2019-12-04', machine='LoPy4 with ESP32', lorawan='1.0.2', sigfox='1.0.1')
After the third round (waiting 2-3 minutes, if no corepanic, then restart device)
the core panic happens again.
Guru Meditation Error: Core 1 panic'ed (Cache disabled but cached memory region accessed)
Guru Meditation Error: Core 1 panic'ed (IllegalInstruction). Exception was unhandled.
Thanks a lot for the link. Once the Timer.Alarm gets called, indeed a lot of Code is executed.
I don’t have enough knowledge what is going on under the hood in detail, but the approaches in the link makes sense. So in our case it is maybe more a interrupt related problem. I will try out. I will enable a flag and execute the Time Synchronization in the Main loop via flags and check whether it helps.
For your information, I have modified the code to only set a flag in the Timer.Alarm part and move the logic to the main threat. Because the Core Dump is still thrown, we can exclude a problem related to interrupts. Next point I can imaging is the main threat and the worker threat for sending events in parallel. Maybe a stack size problem. Memory I have checked, seems to be fine.