thanks a lot for the Dragonfly firmware. I have tested LoPy4-1.20.1.r1-0.7.0-vanilla-dragonfly-onewire with our MicroPython code.
As described by you, the file upload in visual studio has worked very good with your package.
Unfortunately our random core panics still occurs.
They are always happening during a Lora socket read/write/ maybe it is also in the nvram_save function. But only if the device is busy adding events and in parallel via Timer.Alarm a threat is started to synchronize the device time using the following Lora Code.
def sendPayload(self, data, updateTime = True):
try:
with self.socketLock:
self.log("> sending", len(data), "bytes:", binascii.hexlify(data))
responseData = None
# create a LoRa socket
s = socket.socket(socket.AF_LORA, socket.SOCK_RAW)
try:
# data rate: 5=SF7 (fastest), 4=SF8, 3=SF9 (longest airtime)
#s.setsockopt(socket.SOL_LORA, socket.SO_DR, 3)
# send the data
s.setblocking(True)
#bla = 4 / 0
s.send(data)
s.setblocking(False)
# read downlink data
responseData = s.recv(64)
except Exception as e:
self.log("ERROR", "LORA Socket Exception", e)
s.close()
if responseData != None:
responseLen = len(responseData)
if responseLen > 0:
self.log("< received", responseLen, "bytes:", binascii.hexlify(responseData))
else:
self.log("< no downlink")
# log
if updateTime == True:
self.lastUplinkTime = time.time()
self.log(self.stats())
time.sleep_ms(10)
# save frame counters
self.lora.nvram_save()
time.sleep_ms(5)
return responseData
except Exception as e:
self.log("ERROR", "Unable to send payload", e.args[0], e)
return False
Normally during a normal device setup, it takes some days till the core panic occur.
Time gets synchronized each hour, and from time to time a event gets added to the device via badging.
when making automated tests with starting the Time.Alarm each minute and in parallel adding events each minute (writing to a file), it crashes within minutes.
I really don’t know how to go forward with this, because the core panic freezes the devices. If it would restart as desired we could live with it.
The watchdog is activated yet seems to be broken. After 1 minute or longer, the device is not restarting. It is very strange, because we had similar situations earlier this year were 2 threads were running and once one ended in a core panic the device restarted successfully.
But now with 1.20.1.r1 it is no longer restarting and the device seems to be completely freezed.
It seems related to the Lora Modul / ESP32 and Timer.Alarm(), because our Firmware supports the exact same functionality via WLAN.
When the device is in WLAN Mode there are no Core Panics, because the Time Synchronization doesn’t need to be run separately via Timer.Alarm(), because Request/Response are straight forward.
I have attached the log of the Core Panic. It is always the same output. From the log it looks like that even the printing breaks suddenly, and then the esp32 is in a state, were the watchdog / restart doesn’t work.
I had once the idea, to disable the print in the esp32, that without the printing of the exception the device would maybe restart.
However, I was not able to recompile the esp32 part with the pycom as described in following post:
I think you do similar things for the dragonfly package ?
With kind regards,
Pascal.
Here is the log output
TRACE [LORA] Preparing to send CMD = 2 , SEQ_NO = 23
TRACE [LORA] CMD 0x01 [NFC_DETECTED] SEQ# 23 . uid = 02000000 , ts = 1577980530
TRACE [LORA] > sending 11 bytes: b'01170072120e5e02000000'
TRACE [LORA] < no downlink
TRACE [LORA] (rx_timestamp=148130812, rssi=-96, snr=7.0, sfrx=5, sftx=5, tx_trials=0, tx_power=0, tx_time_on_air=62, tx_counter=16186, tx_frequency=867300000)
TRACE [EventSender] pending events: True
TRACE [EventSender] Publishing event # 24 with CMD 2
TRACE [EventSender] Handling event # 24 with CMD 2
TRACE [LORA] Preparing to send CMD = 2 , SEQ_NO = 24
TRACE [LORA] CMD 0x01 [NFC_DETECTED] SEQ# 24 . uid = 03000000 , ts = 1577980531
TRACE [LORA] > sending 11 bytes: b'01180073120e5e03000000'
TRACE [Clock] Starting time sync
TRACE [LORA] CMD 0x04 [TIME_REQUEST] ID# 1 . our_time = 1577980589 (2020, 1, 2, 15, 56, 29, 3, 2)
TRACE [LORA] < no downlink
TRACE [LORA] (rx_timestamp=148130812, rssi=-96, snr=7.0, sfrx=5, sftx=5, tx_trials=0, tx_power=0, tx_time_on_air=62, tx_counter=16187, tx_frequency=868300000)
TRACE [LORA] > sending 7 bytes: b'040100ad120e5e'
Guru Meditation Error: Core 1 panic'ed (Cache disabled but cached memory region accessed)
Guru Meditation Error: Core 1 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x402042e0: bad00bad bad00bad bad00bad
Guru Meditation Error: Core 1 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x402042e0: bad00bad bad00bad bad00bad
Guru Meditation Error: Core 1 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x402042e0: bad00bad bad00bad bad00bad
Guru Meditation Error: Core 1 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x402042e0: bad00bad bad00bad bad00bad
Guru Meditation Error: Core 1 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x402042e0: bad00bad bad00bad bad00bad
Guru Meditation Error: Core 1 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x402042e0: bad00bad bad00bad bad00bad
Guru Meditation Error: Core 1 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x402042e0: bad00bad bad00bad bad00bad
Guru Meditation Error: Core 1 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x402042e0: bad00bad bad00bad bad00bad
Guru Meditation Error: Core 1 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x402042e0: bad00bad bad00bad bad00bad
Guru Meditation Error: Core 1 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x402042e0: bad00bad bad00bad bad00bad
Guru Meditation Error: Core 1 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x402042e0: bad00bad bad00bad bad00bad
Guru Meditation Error: Core 1 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x402042e0: bad00bad bad00bad bad00bad
Guru Meditation Error: Core 1 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x402042e0: bad00bad bad00bad bad00bad
Guru Meditation Error: Core 1 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x402042e0: bad00bad bad00bad bad00bad
Guru Meditation Error: Core 1 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x402042e0: bad00bad bad00bad bad00bad
Guru Meditation Error: Core 1 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x402042e0: bad00bad bad00bad bad00bad
Guru Meditation Error: Core 1 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x402042e0: bad00bad bad00bad bad00bad
Guru Meditation Error: Core 1 panic'ed (IllegalInstruction). Exception was unhandled.
......
Guru Meditation Error: Core 1 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x402042e0: bad00bad bad00bad bad00bad
Guru Meditation Error: Core 1 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x402042e0: bad00bad bad00bad bad00bad
Guru Meditation Error: Core 1 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x402042e0: bad00bad bad00bad bad00bad
Guru Meditation Error: Core 1 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x402042e0: bad00bad bad00bad bad00bad
Guru Meditation Error: Core 1 panic'ed (IllegalInstruction). Exception was unhandled.
Memory dump at 0x402042e0: bad00bad bad00bad bad00bad
Guru Meditation Error: Core�Previou
Thanks for your response.
Unfortunately there is no core dump printed.
All crashes just result in above messages.
Is there a possibility to extend the log to print the desired dump?
Thanks a lot for your message and all your investment trying to help solving this problem.
I am open to help in any way to also improve the long term stability.
Do you see any chance I can support with these two approaches:
Debugging the device? Is there any profound way to do this on my own?
Recompile the pycom firmware and esp32 code with
CONFIG_ESP32_PANIC_SILENT_REBOOT
instead of
CONFIG_ESP32_PANIC_PRINT_REBOOT
to check whether the device restarts, when no message is printed? I know that this is not a solution, because the actual cause needs to be fixed. For us it would just be not so dramatic if the device restarts properly when a core panic happens.
In my setup using the current Terkin version from the master branch in LoRaWAN mode and sending payloads every 5 minutes I’ve not seen any crashes since I put the LoPy4-1.20.1.r1-0.7.0-vanilla-dragonfly-onewire firmware onto the device a couple of weeks ago. Terkin nowadays also uses the nvram_save/nvram_restore capabilities and everything works as expected and super stable.
Please try giving the device some rest after the s.send call before invoking s.recv, e.g. 5 seconds. Also, setblocking is always set to False in our Terkin routines. I remember people running into trouble with a blocking socket.
Thank you so much for your suggestion. I have tried it out. Unfortunately the device is still crashing.
I think now, it is more related to a second task run by Timer.Alarm(), which results in a Stack Size or Memory Size problem inside the ESP32. I make now an automated test without synchronizing the Device Time in intervals with Timer.Alarm(). If this works, we restart the device once a day, synchronize the time at boot and I think we are kind of fine.
If you are able to share some code we could use to reproduce this error, we might be able to look into this. As we’ve invested a considerable amount of time into getting a stable firmware, I would be interested to investigate this further, if time permits.
Oh, you are not alone. Others might be observing the same thing.
@sita on the Pycom user forum (see above: https://forum.pycom.io/topic/5560/guru-meditation-error/13) was able to get stable runtime behavior using the Dragonfly firmware on two LoPy4 devices. However, it took him/her two attempts to do so. So, I am humbly asking if you might want to try again.
Would you be able to share more of the corresponding MicroPython code in order to reproduce the crash or to just inspect where the error might be originating from?
I have tested again with (sysname='LoPy4', nodename='LoPy4', release='1.20.1.r1-0.7.0-vanilla-dragonfly-onewire-i2s', version='daf40f36-dirty on 2019-12-04', machine='LoPy4 with ESP32', lorawan='1.0.2', sigfox='1.0.1')
After the third round (waiting 2-3 minutes, if no corepanic, then restart device)
the core panic happens again.